DocGenome: An Open Large-scale Scientific Document Benchmark for Training Next-generation Large Models
, 500K scientific documents including 6.5M pages with 13 attributes of component units, 6 logical relationships between units, and 7 document-oriented tasks, 2024
We construct DocGenome, a structured document dataset covering annotated 500K scientific documents from 153 disciplines. We show that the performance of our model, trained on DocGenome, surpasses that of the closed-source commercial tools.