Highlighted Projects

DocGenome: An Open Large-scale Scientific Document Benchmark for Training Next-generation Large Models

, 500K scientific documents including 6.5M pages with 13 attributes of component units, 6 logical relationships between units, and 7 document-oriented tasks, 2024

We construct DocGenome, a structured document dataset covering annotated 500K scientific documents from 153 disciplines. We show that the performance of our model, trained on DocGenome, surpasses that of the closed-source commercial tools.

3DTrans Codebase Introduction

, The first codebase for 3D pre-training and continuous learning, 2023

An Open-source Codebase for exploring Continuous-learning/Pre-training-oriented Autonomous Driving Task