Bo Zhang received the Ph.D. degree in electronic engineering from the School of Information Science and Technology, Fudan University. He is currently a Research Scientist in Shanghai AI Laboratory. His work has led to many awards, including Shanghai Rising Star under Grant No. 23QD1401000, awarded by the Shanghai Municipal Commission of Science and Technology, the National Scholarship 2021 China Award, the 2019 Excellent Doctoral Scholarship of Fudan University Award, and various awards from VALSE China and Shanghai Government. His research outcomes have some impacts on industrial applications like airport checkpoint security recognition, domain adaptive face recognition, and localization of concealed dangerous objects.
He has published 30+ papers in top-tier international conferences and journals such as CVPR, NeurIPS, ICML, ICLR, T-PAMI, TIP, T-MM, and IJCV. He also serves as a reviewer for several prestigious academic conferences and journals, including CVPR, ECCV, ICCV, ICLR, and ICML. He led the development of the 3DTrans general scene representation open-source project, which won the Waymo Challenge international competition and accumulated over 1.5k stars. Furthermore, he is committed to exploring the fundamental nature of long-chain reasoning in large models and aims to develop innovator-level agents through reinforcement learning methods and reflection mechanism.

🔥 News

2024:

  • 2024.12:  🎉🎉 Our research project, GeoX, has been officially open-sourced today. It is the first to explore formalized visual-language pre-training in enhancing geometric problem-solving abilities.

  • 2024.10:  🎉🎉 Grateful for the heartfelt recognition and thoughtful sharing of my research work Fudan_CYL and Fudan_SIST .

  • 2024.10:  🎉🎉 The technical report for MinerU, an open-source solution for high-precision document content extraction, has been published.

  • 2024.09:  🎉🎉 Three papers accepted to NeurIPS 2024: AdaptiveDiffusion, ZOPP, LeapAD

  • 2024.09:  🎉🎉 Previous evaluation metrics for Formula and Table Recognition tasks, such as BLEU and Edit Distrance, exhibit limitations. Our CDM has been released to ensure the evaluation objectivity by designing an image-level rather than LaTex-level metric score for Formula and Table Recognition evaluation.

  • 2024.08:  🎉🎉 Bo Zhang was invited to serve as a PC member of AAAI 2025.

  • 2024.08:  🎉🎉 We open-sourced StructTable: Table Structural Extraction Model Models and StructEqTable-Deploy. It is a open-source repository to support the structuring tasks of visual tables.

  • 2024.08:  🎉🎉 We collaborated with the OpenDataLab team to open-source the PDF-Extract-Kit. It can extract high-quality and structured content from PDFs and has gained 6K+ stars.

  • 2024.07:  🎉🎉 One paper (Reg-TTA3D) is accepted by ECCV 2024. We explore test-time adaptive 3d object detection for the first time.

  • 2024.03:  🎉🎉 One paper is accepted by ACL 2024. We propose All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models.

  • 2024.02:  🎉🎉 One paper (Once for Both) is accepted by CVPR 2024. Once for Both: Single Stage of Importance and Sparsity Search for Vision Transformer Compression.

  • 2024.01:  🎉🎉 One paper (ReSimAD) is accepted by ICLR 2024. We propose a zero-shot generalization framework by reconstructing mesh and simulating target point clouds.

  • 2024.01:  🎉🎉 Two papers (IPNet and MVNet) are accepted by TCSVT.

2023:

  • 2023.12:  🎉🎉 We have released the ChartX benchmark, covering 18 chart types, 7 chart tasks, 22 disciplinary topics to evaluate the chart-related capabilities of the existing MLLMS.

  • 2023.09:  🎉🎉 StructChart: our research on visual chart, has been released arXiv paper, where we will release the SimChart9K dataset powered by LLM. By the proposed SimChart9K, we observe that StructChart continuously improves the chart perception performance as more simulated charts are used for pre-training.

  • 2023.09:  🎉🎉 SPOT, showing a promising and scalable 3D pre-training on autonomous driving, has been released (See our paper for more details, arXiv paper).

  • 2023.09:  🎉🎉 - One paper entitled “AD-PT: Autonomous Driving Pre-Training with Large-scale Point Cloud Dataset” is accepted by NeurIPS-2023.

  • 2023.07:  🎉🎉 One paper about cross-domain background-fouced alignment "Rethinking Cross-Domain Pedestrian Detection: A Background-Focused Distribution Alignment Framework for Instance-Free One-Stage Detectors" is accepted by TIP.

  • 2023.07:  🎉🎉 One paper entitled "SUG: Single-dataset Unified Generalization for 3D Point Cloud Classification" is accepted by ACM MM-2023.

  • 2023.04:  🎉🎉 One paper entitled "Performance-aware Approximation of Global Channel Pruning for Multitask CNNs" is accepted for publication in T-PAMI.

  • 2023.03:  🎉🎉 Three papers are accepted by CVPR-2023: Uni3D, Bi3D, GDP.

  • 2023.02:  🎉🎉 Bo Zhang started to work on exploring how to improve the problem-solving and reasoning ability of LLMs or VLMs for complicated modalities, including Chart, Table, Geometry, Scientific Document, by investigating foundation LLM models from the perspective of structured knowledge-rich data.

📝 Selected Publications

SCIS
sym

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, Ji Ma, Jiaqi Wang, Xiaoyi Dong, Hang Yan, Hewei Guo, Conghui He, Botian Shi, Zhenjiang Jin, Chao Xu, Bin Wang, Xingjian Wei, Wei Li, Wenjian Zhang, Bo Zhang, Pinlong Cai, Licheng Wen, Xiangchao Yan, Min Dou, Lewei Lu, Xizhou Zhu, Tong Lu, Dahua Lin, Yu Qiao, Jifeng Dai, Wenhai Wang

[Project][Paper]

  • Propose InternVL 1.5 and InternVL 2. (Rank 1st among open-source VLM models on MMMU, DocVQA, ChartQA, and MathVista.)
NeurIPS 2024
sym

Training-Free Adaptive Diffusion with Bounded Difference Approximation Strategy

Hancheng Ye, Jiakang Yuan, Renqiu Xia, Xiangchao Yan, Tao Chen, Junchi Yan, Botian Shi, Bo Zhang^(corr.)

[Project][Paper]

  • Propose AdaptiveDiffusion to adaptively reduce the noise prediction steps during the denoising proces guided by the third-order latent difference.
NeurIPS 2024
sym

ZOPP: A Framework of Zero-shot Offboard Panoptic Perception for Autonomous Driving

Tao Ma, Hongbin Zhou, Qiusheng Huang, Xuemeng Yang, Jianfei Guo, Bo Zhang, Min Dou, Yu Qiao, Botian Shi, Hongsheng Li

[Project][Paper]

  • ZOPP integrates the powerful zero-shot recognition capabilities of vision foundation models and 3D representations derived from point clouds.
NeurIPS 2024
sym

Continuously Learning, Adapting, and Improving: A Dual-Process Approach to Autonomous Driving

Jianbiao Mei, Yukai Ma, Xuemeng Yang, Licheng Wen, Xinyu Cai, Xin Li, Daocheng Fu, Bo Zhang, Pinlong Cai, Min Dou, Botian Shi, Liang He, Yong Liu, Yu Qiao

[Project][Paper]

  • LeapAD incorporates an innovative dual-process decision-making module, which consists of an Analytic Process (System-II) for thorough analysis and reasoning, along with a Heuristic Process (System-I) for swift and empirical processing.
ICML 2024
sym

On the Emergence of Cross-Task Linearity in the Pretraining-Finetuning Paradigm

Zhanpeng Zhou, Zijun Chen, Yilan Chen, Bo Zhang^(corr.), Junchi Yan

[Project][Paper]

  • We discover an intriguing linear phenomenon in models that are initialized from a common pretrained checkpoint and finetuned on different tasks, termed as Cross-Task Linearity (CTL).
CVPR 2024
sym

Once for Both: Single Stage of Importance and Sparsity Search for Vision Transformer Compression

Hancheng Ye, Chong Yu, Peng Ye, Renqiu Xia, Yansong Tang, Jiwen Lu, Tao Chen, Bo Zhang^(corr.)

[Project][Paper]

  • We investigate how to integrate the evaluations of importance and sparsity scores into a single stage, searching the optimal subnets in an efficient manner.
  • We present OFB, a cost-efficient approach that simultaneously evaluates both importance and sparsity scores, termed Once for Both (OFB), for Vision Transformer Compression (VTC) task.
ICLR 2024
sym

ReSimAD: Zero-Shot 3D Domain Transfer for Autonomous Driving with Source Reconstruction and Target Simulation

Bo Zhang, Xinyu Cai, Jiakang Yuan, Donglin Yang, Jianfei Guo, Xiangchao Yan, Renqiu Xia, Botian Shi, Min Dou, Tao Chen, Si Liu, Junchi Yan, Yu Qiao

[Project][Paper]

  • Provide a new perspective and approach of alleviating the domain shifts, by proposing a Reconstruction-Simulation-Perception scheme.
CVPR 2023
sym

Uni3D: A Unified Baseline for Multi-dataset 3D Object Detection

Bo Zhang, Jiakang Yuan, Botian Shi, Tao Chen, Yikang Li, Yu Qiao

[Project][Paper]

  • Present a Uni3D which tackle multi-dataset 3D object detection from data-level and semantic-level.
NeurIPS 2023
sym

AD-PT: Autonomous Driving Pre-Training with Large-scale Point Cloud Dataset

Jiakang Yuan, Bo Zhang^(corr.), Xiangchao Yan, Tao Chen, Botian Shi, Yikang Li, Yu Qiao

[Project][Paper]

  • Build a large-scale pre-training point-cloud dataset with diverse data distribution, and meanwhile learn generalizable representations.
CVPR 2023
sym

Bi3D: Bi-domain Active Learning for Cross-domain 3D Object Detection

Jiakang Yuan, Bo Zhang^(corr.), Xiangchao Yan, Tao Chen, Botian Shi, Yikang Li, Yu Qiao

[Project][Paper]

  • Propose a Bi-domain active learning approach which select samples from both source and target domain to solve the cross-domain 3D object detection task.
ECCV 2024
sym

Reg-TTA3D: Better Regression Makes Better Test-time Adaptive 3D Object Detection

Jiakang Yuan, Bo Zhang, Kaixiong Gong, Xiangyu Yue, Botian Shi, Yu Qiao, Tao Chen

[Project][Paper]

  • Explore a new task named test-time domain adaptive 3D object detection and propose a pseudo-label-based test-time adaptative 3D object detection method.
ACM'MM 2023
sym

SUG: Single-dataset Unified Generalization for 3D Point Cloud Classification

Siyuan Huang, Bo Zhang^(corr.), Botian Shi, Peng Gao, Yikang Li, Hongsheng Li

[Project][Paper]

  • Propose a Single-dataset Unified Generalization (SUG) framework that only leverages a single source dataset to alleviate the unforeseen domain differences faced by a well-trained source model. .
TPAMI
sym

Performance-aware Approximation of Global Channel Pruning for Multitask CNNs

Hancheng Ye, Bo Zhang, Tao Chen, Jiayuan Fan, Bin Wang

[Project][Paper]

  • We propose a Performance-Aware Global Channel Pruning (PAGCP) framework. We first theoretically present the objective for achieving superior GCP, by considering the joint saliency of filters from intra- and inter-layers.
TIP
sym

Sample-Centric Feature Generation for Semi-Supervised Few-Shot Learning

Bo Zhang, Hancheng Ye, Gang Yu, Bin Wang, Yike Wu, Jiayuan Fan, Tao Chen

[Project][Paper]

  • Propose a sample-centric feature generation (SFG) approach for semi-supervised few-shot image classification.
ACM'MM 2022
sym

Learning Cross-Image Object Semantic Relation in Transformer for Few-Shot Fine-Grained Image Classification

Bo Zhang, Jiakang Yuan, Baopu Li, Tao Chen, Jiayuan Fan, Botian Shi

[Project][Paper]

  • Propose a Transformer-based double-helix model to achieve the cross-image object semantic relation mining in a bidirectional and symmetrical manner.

💬 Invited Talks

  • 2024.07, Invited talk of Multimodal Large Model Summit. [Video]
  • 2023.09, Invited talk of Effcient Pre-training of Autonomous Driving. [Video]
  • 2023.07, Invited talk of Towards 3D General Representation at Techbeat. [Video]
  • 2023.03, Invited talk of Transferable Pwerception of Autonomous Driving. [Video]

💻 Internships

📝 Collaborators