Skip to content

A paper list of large multimodal model (large vision-language model)

License

Notifications You must be signed in to change notification settings

youngtboy/Awesome-Large-Multimodal-Model

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 

Repository files navigation

Awesome-Large-Multimodal-Model

A paper list of large multimodal model (large vision-language model)

Survey

Year Pub. Title Links
2023 MIR'23 Large-scale Multi-Modal Pre-trained Models: A Comprehensive Survey
Xiao Wang, Guangyao Chen, Guangwu Qian, Pengcheng Gao, Xiao-Yong Wei, Yaowei Wang, Yonghong Tian, Wen Gao
Paper / Code
2023 arxiv'23 Vision-Language Models for Vision Tasks:A Survey
Jingyi Zhang, Jiaxing Huang, Sheng Jin, Shijian Lu
Paper / Code
2023 arxiv'23 A Survey on Multimodal Large Language Models
Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, Enhong Chen
Paper / Code
2024 arxiv'24 Exploring the Frontier of Vision-Language Models: A Survey of Current Methodologies and Future Directions
Akash Ghosh, Arkadeep Acharya, Sriparna Saha, Vinija Jain, Aman Chadha
Paper
2024 arxiv'24 Efficient Multimodal Large Language Models:A Survey
Yizhang Jin, Jian Li, Yexin Liu, Tianjun Gu, Kai Wu, Zhengkai Jiang, Muyang He, Bo Zhao, Xin Tan, Zhenye Gan, Yabiao Wang, Chengjie Wang, Lizhuang Ma
Paper / Code

2023

Name Pub. Title Links
BLIP2 ICML'23 BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
Junnan Li, Dongxu Li, Silvio Savarese, Steven Hoi
Paper / Code
LLaVA NIPS'23 Visual Instruction Tuning
Haotian Liu, Chunyuan Li, Qingyang Wu, Yong Jae Lee
Paper / Code
Emu1 ICLR'24 Emu: Generative Pretraining in Multimodality
Quan Sun, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong Zhang, Yueze Wang, Hongcheng Gao, Jingjing Liu, Tiejun Huang, Xinlong Wang
Paper / Code
Emu2 CVPR'24 Generative Multimodal Models are In-Context Learners
Quan Sun, Yufeng Cui, Xiaosong Zhang, Fan Zhang, Qiying Yu, Zhengxiong Luo, Yueze Wang, Yongming Rao, Jingjing Liu, Tiejun Huang, Xinlong Wang
Paper / Code
InternVL CVPR'24 InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic
Tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, Jifeng Dai
Paper / Code

2024

Name Pub. Title Links
ViTamin CVPR'24 ViTamin: Designing Scalable Vision Models in the Vision-Language Era
Jieneng Chen, Qihang Yu, Xiaohui Shen, Alan Yuille, Liang-Chieh Chen
Paper / Code
EVE1 NIPS'24 Unveiling Encoder-Free Vision-Language Models
Haiwen Diao, Yufeng Cui, Xiaotong Li, Yueze Wang, Huchuan Lu, Xinlong Wang
Paper / Code
Emu3 arxiv'24 Emu3: Next-Token Prediction is All You Need
Emu3 Team, BAAI
Paper / Code
Mono-InternVL CVPR'25 Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training
Gen Luo, Xue Yang, Wenhan Dou, Zhaokai Wang, Jiawen Liu, Jifeng Dai, Yu Qiao, Xizhou Zhu
Paper / Code

2025

Name Pub. Title Links
LLaVA-mini arxiv'25 LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token
Shaolei Zhang, Qingkai Fang, Zhe Yang, Yang Feng
Paper / Code
R1-V - R1-V: Reinforcing Super Generalization Ability in Vision Language Models with Less Than $3
Liang Chen, Lei Li, Haozhe Zhao, Yifan Song, Vinci, Zihao Yue
Blog / Code
EVE2 arxiv'25 EVEv2: Improved Baselines for Encoder-Free Vision-Language Models
Haiwen Diao, Xiaotong Li, Yufeng Cui, Yueze Wang, Haoge Deng, Ting Pan, Wenxuan Wang, Huchuan Lu, Xinlong Wang
Paper / Code
Qwen2.5-VL arxiv'25 Qwen2.5-VL Technical Report
Qwen Team, Alibaba Group
Paper / Code
MedVLM-R1 arxiv'25 MedVLM-R1: Incentivizing Medical Reasoning Capability of Vision-Language Models (VLMs) via Reinforcement Learning
Jiazhen Pan, Che Liu, Junde Wu, Fenglin Liu, Jiayuan Zhu, Hongwei Bran Li, Chen Chen, Cheng Ouyang, Daniel Rueckert
Paper
VLM-R1 - VLM-R1: A stable and generalizable R1-style Large Vision-Language Model
Om AI Lab Team
Code
ViRFT arxiv'25 Visual-RFT: Visual Reinforcement Fine-Tuning
Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, Jiaqi Wang
Paper / Code

About

A paper list of large multimodal model (large vision-language model)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published