DocGenome: An Open Large-scale Scientific Document Benchmark for Training Next-generation Large Models

Published in 500K scientific documents including 6.5M pages with 13 attributes of component units, 6 logical relationships between units, and 7 document-oriented tasks, 2024

Overall

This blog will introduce our proposed large-scale scientific dataset, DocGenome, which is constructed using our custom auto-labeling pipeline, DocParser.

DocParser is a cutting-edge auto-labeling pipeline, which can generate both attribute information of component units and logical relationships between units by auto-annotating and structuring a large amount of unlabeled arXiv papers, with four stages:

  • 1) data preprocessing,
  • 2) unit segmentation,
  • 3) attribute assignment and relation retrieval
  • 4) color rendering

By DocParser, we construct the DocGenome by auto-annotating and structuring unlabeled arXiv papers, and it features four characteristics:

  • 1) Completeness: It is the first dataset to structure data from all modalities including 13 layout attributes along with their LaTeX source codes.
  • 2) Logicality: It provides 6 logical relationships between different entities within each scientific document.
  • 3) Diversity: It covers various document-oriented tasks, including document classification, visual grounding, document layout detection, document transformation, open-ended single-page QA and multi-page QA.
  • 4) Correctness: It undergoes rigorous quality control checks conducted by a specialized team.

Besides, based on DocGenome, we conduct extensive experiments to demonstrate the advantages of DocGenome and objectively evaluate the performance of current large models on our benchmark.

 


Release

 

DocGenome Benchmark Introduction

Datasets# Discipline# Category of Units# Pages in Train-set# Pages in Test-set# Task# Used MetricPublicationEntity Relations
Β Β Β Β Β Β Β Β Β 
DocVQA-N/A11K1K121960-2000❎
DocLayNet-1180K8K11-❎
DocBank-130.45M50K312014-2018❎
PubLayNet-50.34M12K11-❎
VRDU-107K3K31-❎
DUDE-N/A20K6K331860-2022❎
D^4LA-278K2K13-❎
Fox Benchmark-5N/A (No train-set)0.2K35-❎
ArXivCap32N/A6.4M*N/A43-❎
DocGenome (ours)153136.8M9K772007-2022βœ…

 

πŸ‘‡πŸ»DocGenome-train Download

We provide 8 subsets of DocGenome-train for downloading:

Data Download - [docgenome-train-000.tar.gz]() - [docgenome-train-001.tar.gz]() - [docgenome-train-002.tar.gz]() - [docgenome-train-003.tar.gz]() - [docgenome-train-004.tar.gz]() - [docgenome-train-005.tar.gz]() - [docgenome-train-006.tar.gz]() - [docgenome-train-007.tar.gz]()

Definition of relationships between component units

DocGenome contains 4 level relation types and 2 cite relation types, as shown in the following table:

| Name | Description | Example | |β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”|β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”|β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”-| | Identical | Two blocks share the same source code. | Cross-column text; Cross-page text. | | Title adjacen | The two titles are adjacent. | (\textbackslash section{introduction}, \textbackslash section{method}) | | Subordinate | One block is a subclass of another block. | (\textbackslash section{introduction}, paragraph within Introduction) | | Non-title adjacent | The two text or equation blocks are adjacent. | (Paragraph 1, Paragraph 2) | | Explicitly-referred | One block refers to another block via footnote, reference, etc. | (As shown in \textbackslash ref{Fig: 5} …, Figure 5) | | Implicitly-referred | The caption block refers to the corresponding float environment. | (Table Caption 1, Table 1)
</details>

Attribute of component units

DocGenome has 13 attributes of component units, which can be categorized into two classes

  • 1) Fixed-form units, including Text, Title, Abstract, etc., which are characterized by sequential reading and hierarchical relationships readily discernible from the list obtained in Stage-two of the designed DocParser.
  • 2) Floating-form units, including Table, Figure, etc., which establish directional references to fixed-form units through commands like \ref and \label.
IndexCategoryNotes
0AlgorithmΒ 
1CaptionTitles of Images, Tables, and Algorithms
2EquationΒ 
3FigureΒ 
4FootnoteΒ 
5ListΒ 
7TableΒ 
8TextΒ 
9Text-EQText block with inline equations
10TitleSection titles
12PaperTitleΒ 
13CodeΒ 
14AbstractΒ 

Note that we do not use the β€œothers” category and the β€œreference” category, and their indices are 6 and 11, respectively.

Types of disciplines

Page distribution of DocGenome. 20\% of documents are five pages or fewer, 50\% are ten pages or fewer, and 80\% are nineteen pages or fewer.

Page Distribution

 

Distribution of secondary disciplines in our DocGenome. The count on the x-axis represents the number of documents, and documents from the same primary discipline are marked with the same color.

Discipline Distribution

 

DocParser: A Cutting-edge Auto-labeling Pipeline

Schematic of the designed DocParser pipeline for automated document annotation The process is divided into four distinct stages:

  • 1) Data Preprocessing,
  • 2) Unit Segmentation,
  • 3) Attribute Assignment and Relation Retrieval,
  • 4) Color Rendering.

DocParser can convert LaTeX source code of a complete document into annotations for component units with source-code, attributes, relationships and bounding box, as well as a rendered PNG of the entire document.

Visualizations

Visual examples of document-oriented tasks in DocGenome

Citation

If you find our work useful in your research, please consider citing Fox:

@article{,

}