Multimodal Structured Document Understanding and MLLM Generated Document Detection

Date:

Recorded session link Slides here


In this talk, I will give a comprehensive introduction of research progresses in MMLM-driven scientific document understanding.


The multimodal large language model has shown strong capabilities in solving tasks such as visual text, visual speech, and visual reasoning. However, in practical applications, data often exhibits diverse and multimodal characteristics, such as text, image, video, sound, etc. How to effectively leverage these multimodal data and achieve unified structured representation learning to eliminate the differences between different modalities has become a crucial challenge in current research on large models.


This report first reviews the existing studies of multimodal large models in understanding content-based recognition tasks. Then, it addresses the problems that multimodal large models have difficulty with complex modalities like tables, charts, geometric images, and the inability to understand the logical relationships within scientific documents. A unified structured representation form is proposed, which combines theoretical models to make the multimodal large model capable of verifiable and traceable reasoning. This can help mitigate the illusions caused by the model during execution of reasoning tasks. Furthermore, with the increasing generation capability of multimodal large models, they can generate highly realistic images, texts, videos, etc., which are difficult to distinguish from real ones. This poses challenges for academic document writing, fake news production, etc. Therefore, this report also reviews the detection methods for artificial vision synthesis content and text synthesis content, especially zero-shot detection methods, the latest research results, and future exploration directions.