A paper list of large multimodal model (large vision-language model)
Year | Pub. | Title | Links |
---|---|---|---|
2023 | MIR'23 | Large-scale Multi-Modal Pre-trained Models: A Comprehensive Survey Xiao Wang, Guangyao Chen, Guangwu Qian, Pengcheng Gao, Xiao-Yong Wei, Yaowei Wang, Yonghong Tian, Wen Gao |
Paper / Code |
2023 | arxiv'23 | Vision-Language Models for Vision Tasks:A Survey Jingyi Zhang, Jiaxing Huang, Sheng Jin, Shijian Lu |
Paper / Code |
2023 | arxiv'23 | A Survey on Multimodal Large Language Models Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, Enhong Chen |
Paper / Code |
2024 | arxiv'24 | Exploring the Frontier of Vision-Language Models: A Survey of Current Methodologies and Future Directions Akash Ghosh, Arkadeep Acharya, Sriparna Saha, Vinija Jain, Aman Chadha |
Paper |
2024 | arxiv'24 | Efficient Multimodal Large Language Models:A Survey Yizhang Jin, Jian Li, Yexin Liu, Tianjun Gu, Kai Wu, Zhengkai Jiang, Muyang He, Bo Zhao, Xin Tan, Zhenye Gan, Yabiao Wang, Chengjie Wang, Lizhuang Ma |
Paper / Code |
Name | Pub. | Title | Links |
---|---|---|---|
BLIP2 | ICML'23 | BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models Junnan Li, Dongxu Li, Silvio Savarese, Steven Hoi |
Paper / Code |
LLaVA | NIPS'23 | Visual Instruction Tuning Haotian Liu, Chunyuan Li, Qingyang Wu, Yong Jae Lee |
Paper / Code |
Emu1 | ICLR'24 | Emu: Generative Pretraining in Multimodality Quan Sun, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong Zhang, Yueze Wang, Hongcheng Gao, Jingjing Liu, Tiejun Huang, Xinlong Wang |
Paper / Code |
Emu2 | CVPR'24 | Generative Multimodal Models are In-Context Learners Quan Sun, Yufeng Cui, Xiaosong Zhang, Fan Zhang, Qiying Yu, Zhengxiong Luo, Yueze Wang, Yongming Rao, Jingjing Liu, Tiejun Huang, Xinlong Wang |
Paper / Code |
InternVL | CVPR'24 | InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, Jifeng Dai |
Paper / Code |
Name | Pub. | Title | Links |
---|---|---|---|
ViTamin | CVPR'24 | ViTamin: Designing Scalable Vision Models in the Vision-Language Era Jieneng Chen, Qihang Yu, Xiaohui Shen, Alan Yuille, Liang-Chieh Chen |
Paper / Code |
EVE1 | NIPS'24 | Unveiling Encoder-Free Vision-Language Models Haiwen Diao, Yufeng Cui, Xiaotong Li, Yueze Wang, Huchuan Lu, Xinlong Wang |
Paper / Code |
Emu3 | arxiv'24 | Emu3: Next-Token Prediction is All You Need Emu3 Team, BAAI |
Paper / Code |
Mono-InternVL | CVPR'25 | Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training Gen Luo, Xue Yang, Wenhan Dou, Zhaokai Wang, Jiawen Liu, Jifeng Dai, Yu Qiao, Xizhou Zhu |
Paper / Code |
Name | Pub. | Title | Links |
---|---|---|---|
LLaVA-mini | arxiv'25 | LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token Shaolei Zhang, Qingkai Fang, Zhe Yang, Yang Feng |
Paper / Code |
R1-V | - | R1-V: Reinforcing Super Generalization Ability in Vision Language Models with Less Than $3 Liang Chen, Lei Li, Haozhe Zhao, Yifan Song, Vinci, Zihao Yue |
Blog / Code |
EVE2 | arxiv'25 | EVEv2: Improved Baselines for Encoder-Free Vision-Language Models Haiwen Diao, Xiaotong Li, Yufeng Cui, Yueze Wang, Haoge Deng, Ting Pan, Wenxuan Wang, Huchuan Lu, Xinlong Wang |
Paper / Code |
Qwen2.5-VL | arxiv'25 | Qwen2.5-VL Technical Report Qwen Team, Alibaba Group |
Paper / Code |
MedVLM-R1 | arxiv'25 | MedVLM-R1: Incentivizing Medical Reasoning Capability of Vision-Language Models (VLMs) via Reinforcement Learning Jiazhen Pan, Che Liu, Junde Wu, Fenglin Liu, Jiayuan Zhu, Hongwei Bran Li, Chen Chen, Cheng Ouyang, Daniel Rueckert |
Paper |
VLM-R1 | - | VLM-R1: A stable and generalizable R1-style Large Vision-Language Model Om AI Lab Team |
Code |
ViRFT | arxiv'25 | Visual-RFT: Visual Reinforcement Fine-Tuning Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, Jiaqi Wang |
Paper / Code |