This is a paper list of self-supervised pretraining method. All papers are listed in order of their appearance in arxiv.
In addition, papers are also categorized according to different topics. You can click on the links below to get related papers on the topics you are interested in.
- Contrastive Learning/Joint Embedding
- Masked Image Modeling
- Light-weight Model Pretraing
- CNN Pretraing
- ViT Pretraining
- Hierarchical Model Pretraining
- Dense Prediction Model Pretraining
- Large Vision Model/Foundation Model Pretraining
- updating ···
-
[MoCov1] 🌟 Momentum Contrast for Unsupervised Visual Representation Learningn | [CVPR'20] |
[paper]
[code]
-
[SimCLRv1] 🌟 A Simple Framework for Contrastive Learning of Visual Representations | [ICML'20] |
[paper]
[code]
-
[MoCov2] Improved Baselines with Momentum Contrastive Learning | [arxiv'20] |
[paper]
[code]
-
[BYOL] Bootstrap your own latent: A new approach to self-supervised Learning | [NIPS'20] |
[paper]
[code]
-
[SimCLRv2] Big Self-Supervised Models are Strong Semi-Supervised Learners | [NIPS'20] |
[paper]
[code]
-
[SwAV] Unsupervised Learning of Visual Features by Contrasting Cluster Assignments | [NIPS'20] |
[paper]
[code]
-
[RELICv1] Representation Learning via Invariant Causal Mechanisms | [ICLR'21] |
[paper]
-
[CompRess] CompRess: Self-Supervised Learning by Compressing Representations | [NIPS'20] |
[paper]
[code]
-
[DenseCL] Dense Contrastive Learning for Self-Supervised Visual Pre-Training | [CVPR'21] |
[paper]
[code]
-
[SimSiam] 🌟 Exploring Simple Siamese Representation Learning | [CVPR'21]
[paper]
[code]
-
[SEED] SEED: Self-supervised Distillation For Visual Representation | [ICLR'21] |
[paper]
[code]
-
[ALIGN] Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision | [ICML'21] |
[paper]
-
[CLIP] 🌟 Learning Transferable Visual Models From Natural Language Supervision | [ICML'21] |
[paper]
[code]
-
[Barlow Twins] Barlow Twins: Self-Supervised Learning via Redundancy Reduction | [ICML'21] |
[paper]
[code]
-
[S3L] Rethinking Self-Supervised Learning: Small is Beautiful | [arxiv'21] |
[paper]
[code]
-
[MoCov3] 🌟 An Empirical Study of Training Self-Supervised Vision Transformers | [ICCV'21] |
[paper]
[code]
-
[DisCo] DisCo: Remedy Self-supervised Learning on Lightweight Models with Distilled Contrastive Learning | [ECCV'22] |
[paper]
[code]
-
[DoGo] Distill on the Go: Online knowledge distillation in self-supervised learning | [CVPRW'21] |
[paper]
[code]
-
[DINOv1] 🌟 Emerging Properties in Self-Supervised Vision Transformers | [ICCV'21] |
[paper]
[code]
-
[VICReg] VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning | [ICLR'22] |
[paper]
[code]
-
[MST] MST: Masked Self-Supervised Transformer for Visual Representation | [NIPS'21] |
[paper]
-
[BEiTv1] 🌟 BEiT: BERT Pre-Training of Image Transformers | [ICLR'22] |
[paper]
[code]
-
[SimDis] Simple Distillation Baselines for Improving Small Self-supervised Models | [ICCVW'21] |
[paper]
[code]
-
[OSS] Unsupervised Representation Transfer for Small Networks: I Believe I Can Distill On-the-Fly | [NIPS'21] |
[paper]
-
[BINGO] Bag of Instances Aggregation Boosts Self-supervised Distillation | [ICLR'22] |
[paper]
[code]
-
[SSL-Small] On the Efficacy of Small Self-Supervised Contrastive Models without Distillation Signals | [AAAI'22] |
[paper]
[code]
-
[C-BYOL/C-SimLCR] Compressive Visual Representations | [NIPS'21] |
[paper]
[code]
-
[MAE] 🌟 Masked Autoencoders Are Scalable Vision Learners | [CVPR'22] |
[paper]
[code]
-
[iBOT] iBOT: Image BERT Pre-Training with Online Tokenizer | [ICLR'22] |
[paper]
[code]
-
[SimMIM] 🌟 SimMIM: A Simple Framework for Masked Image Modeling | [CVPR'22] |
[paper]
[code]
-
[PeCo] PeCo:Perceptual Codebook for BERT Pre-training of Vision Transformers | [AAAI'23] |
[paper]
-
[MaskFeat] Masked Feature Prediction for Self-Supervised Visual Pre-Training | [CVPR'22] |
[paper]
[code]
-
[RELICv2] Pushing the limits of self-supervised ResNets: Can we outperform supervised learning without labels on ImageNet? | [arxiv'22] |
[paper]
-
[SimReg] SimReg: Regression as a Simple Yet Effective Tool for Self-supervised Knowledge Distillation | [BMVC'21] |
[paper]
[code]
-
[RePre] RePre: Improving Self-Supervised Vision Transformer with Reconstructive Pre-training | [arxiv'22] |
[paper]
-
[CAEv1] Context Autoencoder for Self-Supervised Representation Learning | [arxiv'22] |
[paper]
[code]
-
[CIM] Corrupted Image Modeling for Self-Supervised Visual Pre-Training | [ICLR'23] |
[paper]
-
[MVP] MVP: Multimodality-guided Visual Pre-training | [ECCV'22] |
[paper]
-
[ConvMAE] ConvMAE: Masked Convolution Meets Masked Autoencoders | [NIPS'22] |
[paper]
[code]
-
[ConMIM] Masked Image Modeling with Denoising Contrast | [ICLR'23] |
[paper]
[code]
-
[MixMAE] MixMAE: Mixed and Masked Autoencoder for Efficient Pretraining of Hierarchical Vision Transformers | [CVPR'23] |
[paper]
[code]
-
[A2MIM] Architecture-Agnostic Masked Image Modeling -- From ViT back to CNN | [ICML'23] |
[paper]
[code]
-
[FD] Contrastive Learning Rivals Masked Image Modeling in Fine-tuning via Feature Distillation | [arxiv'22] |
[paper]
[code]
-
[ObjMAE] Object-wise Masked Autoencoders for Fast Pre-training | [arxiv'22] |
[paper]
-
[MAE-Lite] A Closer Look at Self-Supervised Lightweight Vision Transformers | [ICML'23] |
[paper]
[code]
-
[SupMAE] SupMAE: Supervised Masked Autoencoders Are Efficient Vision Learners | [arxiv'22] |
[paper]
[code]
-
[HiViT] HiViT: Hierarchical Vision Transformer Meets Masked Image Modeling | [ICLR'23] |
[paper]
[mmpretrian code]
-
[LoMaR] Efficient Self-supervised Vision Pretraining with Local Masked Reconstruction | [arxiv'22] |
[paper]
[code]
-
[SIM] Siamese Image Modeling for Self-Supervised Vision Representation Learning | [CVPR'23] |
[paper]
[code]
-
[MFM] Masked Frequency Modeling for Self-Supervised Visual Pre-Training | [ICLR'23] |
[paper]
[code]
-
[BootMAE] Bootstrapped Masked Autoencoders for Vision BERT Pretraining | [ECCV'22] |
[paper]
[code]
-
[SatMAE] SatMAE: Pre-training Transformers for Temporal and Multi-Spectral Satellite Imagery | [NIPS'22] |
[paper]
[code]
-
[TinyViT] TinyViT: Fast Pretraining Distillation for Small Vision Transformers | [ECCV'22] |
[paper]
[code]
-
[CMAE] Contrastive Masked Autoencoders are Stronger Vision Learners | [arxiv'22] |
[paper]
[code]
-
[SMD] Improving Self-supervised Lightweight Model Learning via Hard-aware Metric Distillation | [ECCV'22] |
[paper]
[code]
-
[SdAE] SdAE: Self-distillated Masked Autoencoder | [ECCV'22] |
[paper]
[code]
-
[MILAN] MILAN: Masked Image Pretraining on Language Assisted Representation | [arxiv'22] |
[paper]
[code]
-
[BEiTv2] BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers | [arxiv'22] |
[paper]
[code]
-
[BEiTv3] Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks | [CVPR'23] |
[paper]
[code]
-
[MaskCLIP] MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image Pretraining | [CVPR'23] |
[paper]
[code]
-
[MimCo] MimCo: Masked Image Modeling Pre-training with Contrastive Teacher | [arxiv'22] |
[paper]
-
[VICRegL] VICRegL: Self-Supervised Learning of Local Visual Features | [NIPS'22] |
[paper]
[code]
-
[SSLight] Effective Self-supervised Pre-training on Low-compute Networks without Distillation | [ICLR'23] |
[paper]
[code]
-
[U-MAE] How Mask Matters: Towards Theoretical Understandings of Masked Autoencoders | [NIPS'22] |
[paper]
[code]
-
[i-MAE] i-MAE: Are Latent Representations in Masked Autoencoders Linearly Separable? | [axiv'22] |
[paper]
[code]
-
[CAN] A simple, efficient and scalable contrastive masked autoencoder for learning visual representations | [arxiv'22] |
[paper]
[code]
-
[EVA] EVA: Exploring the Limits of Masked Visual Representation Learning at Scale | [CVPR'23] |
[paper]
[code]
-
[CAEv2] CAE v2: Context Autoencoder with CLIP Target | [arxiv'22] |
[paper]
-
[iTPN] Integrally Pre-Trained Transformer Pyramid Networks | [CVPR'23] |
[paper]
[code]
-
[SCFS] Semantics-Consistent Feature Search for Self-Supervised Visual Representation Learning | [ICCV'23] |
[paper]
[code]
-
[FastMIM] FastMIM: Expediting Masked Image Modeling Pre-training for Vision | [arxiv'22] |
[paper]
[code]
-
[Light-MoCo] Establishing a stronger baseline for lightweight contrastive models | [ICME'23] |
[paper]
[code]
[ICLR'23 under-review version]
-
[Scale-MAE] Scale-MAE: A Scale-Aware Masked Autoencoder for Multiscale Geospatial Representation Learning | [ICCV'23] |
[paper]
-
[ConvNeXtv2] ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders | [CVPR'23] |
[paper]
[code]
-
[Spark] Designing BERT for Convolutional Networks: Sparse and Hierarchical Masked Modeling | [ICLR'23] |
[paper]
[code]
-
[I-JEPA] Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture | [CVPR'23] |
[paper]
[code]
-
[RoB] A Simple Recipe for Competitive Low-compute Self supervised Vision Models | [arxiv'23] |
[paper]
-
[Layer Grafted] Layer Grafted Pre-training: Bridging Contrastive Learning And Masked Image Modeling For Label-Efficient Representations | [ICLR'23] |
[paper]
[code]
-
[G2SD] Generic-to-Specific Distillation of Masked Autoencoders | [CVPR'23] |
[paper]
[code]
-
[PixMIM] PixMIM: Rethinking Pixel Reconstruction in Masked Image Modeling | [arxiv'23] |
[paper]
[code]
-
[LocalMIM] Masked Image Modeling with Local Multi-Scale Reconstruction | [CVPR'23] |
[paper]
[code]
-
[MR-MAE] Mimic before Reconstruct: Enhancing Masked Autoencoders with Feature Mimicking | [arxiv'23] |
[paper]
-
[Overcoming-Pretraining-Bias] Overwriting Pretrained Bias with Finetuning Data | [ICCV'23] |
[paper]
[code]
-
[EVA-02] EVA-02: A Visual Representation for Neon Genesis | [arxiv'23] |
[paper]
[code]
-
[EVA-CLIP] EVA-CLIP: Improved Training Techniques for CLIP at Scale | [arxiv'23] |
[paper]
[code]
-
[MixedAE] Mixed Autoencoder for Self-supervised Visual Representation Learning | [CVPR'23] |
[paper]
-
[EMP] EMP-SSL: Towards Self-Supervised Learning in One Training Epoch | [arxiv'23] |
[paper]
[code]
-
[DINOv2] DINOv2:Learning Robust Visual Features without Supervision | [arxiv'23] |
[paper]
[code]
-
[CL-vs-MIM] What Do Self-Supervised Vision Transformers Learn? | [ICLR'23] |
[paper]
[code]
-
[SiamMAE] Siamese Masked Autoencoders | [NIPS'23] |
[paper]
-
[ccMIM] Contextual Image Masking Modeling via Synergized Contrasting without View Augmentation for Faster and Better Visual Pretraining | [ICLR'23] |
[paper]
[code]
-
[DreamTeacher] DreamTeacher: Pretraining Image Backbones with Deep Generative Models | [ICCV'23] |
[paper]
-
[MFF] Improving Pixel-based MIM by Reducing Wasted Modeling Capability | [ICCV'23] |
[paper]
[code]
-
[DropPos] DropPos: Pre-Training Vision Transformers by Reconstructing Dropped Positions | [NIPS'23] |
[paper]
[code]
-
[Registers] Vision Transformers Need Registers | [arxiv'23] |
[paper]
[code]
-
[MetaCLIP] Demystifying CLIP Data | [ICLR'24] |
[paper]
[code]
-
[AMD] Asymmetric Masked Distillation for Pre-Training Small Foundation Models | [CVPR'24] |
[paper]
[code]
-
[D-iGPT] Rejuvenating image-GPT as Strong Visual Representation Learners | [arxiv'23] |
[paper]
[code]
-
[SynCLR] Learning Vision from Models Rivals Learning Vision from Data | [arxiv'23] |
[paper]
[code]
-
[AIM] Scalable Pre-training of Large Autoregressive Image Models | [arxiv'24] |
[paper]
[code]
-
[CrossMAE] Rethinking Patch Dependence for Masked Autoencoders | [arxiv'24] |
[paper]
[code]
-
[Cross-Scale MAE] Cross-Scale MAE: A Tale of Multi-Scale Exploitation in Remote Sensing | [NIPS'23] |
[paper]
[code]
-
[EVA-CLIP-18B] EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters | [arxiv'24] |
[paper]
[code]
-
[MIM-Refiner] MIM-Refiner: A Contrastive Learning Boost from Intermediate Pre-Trained Representations | [arxiv'24] |
[paper]
[code]
-
[SatMAE++] Rethinking Transformers Pre-training for Multi-Spectral Satellite Imagery | [CVPR'24] |
[paper]
[code]
-
[Augmentations vs Algorithms] Augmentations vs Algorithms: What Works in Self-Supervised Learning | [arxiv'24] |
[paper]
-
[CropMAE] Efficient Image Pre-Training with Siamese Cropped Masked Autoencoders | [ECCV'24] |
[paper]
[code]
-
[Retro] Retro: Reusing teacher projection head for efficient embedding distillation on Lightweight Models via Self-supervised Learning | [arxiv'24] |
[paper]
[ICLR'24 under-review version]
-
[ssl-data-curation] Automatic Data Curation for Self-Supervised Learning: A Clustering-Based Approach | [arxiv'24] |
[paper]
[code]
-
[MaSSL] Learning from Memory: Non-Parametric Memory Augmented Self-Supervised Learning of Visual Features | [ICML'24] |
[paper]
[code]
-
[SINDER] SINDER: Repairing the Singular Defects of DINOv2 | [ECCV'24] |
[paper]
[code]
-
[MICM] MICM: Rethinking Unsupervised Pretraining for Enhanced Few-shot Learning | [ACM MM'24] |
[paper]
[code]
-
[dino.txt] DINOv2 Meets Text: A Unified Framework for Image- and Pixel-Level Vision-Language Alignment | [arxiv'24] |
[paper]