Vision Transformer-based Dual-Stream Self-Supervised Pretraining Networks for Retinal OCT Classification
Mohammadreza Saraei 1, Dr. Igor Kozak (Website), Dr. Eung-Joo Lee (Website)
Code [GitHub] | Data [MedMNISTv2] | Preprint [ArXiv] | Publication [Under Review in MIDL 2025]
Optical Coherence Tomography (OCT) is a non-invasive imaging modality essential for diagnosing various eye diseases. Despite its clinical significance, developing OCT-based diagnostic tools faces challenges, such as limited public datasets, sparse annotations, and privacy concerns. Although deep learning has made progress in automating OCT analysis, these challenges remain unresolved. To address these limitations, we introduce the Vision Transformer-based Dual-Stream Self-Supervised Pretraining Network (ViT-2SPN), a novel framework designed to enhance feature extraction and improve diagnostic accuracy. ViT-2SPN employs a three-stage workflow: Supervised Pretraining, Self-Supervised Pretraining (SSP), and Supervised Fine-Tuning. The pretraining phase leverages the OCTMNIST dataset (97,477 unlabeled images across four disease classes) with data augmentation to create dual-augmented views. A Vision Transformer (ViT-Base) backbone extracts features, while a negative cosine similarity loss aligns feature representations. Pretraining is conducted over 50 epochs with a learning rate of 0.0001 and momentum of 0.999. Fine-tuning is performed on a stratified 5.129\% subset of OCTMNIST using 10-fold cross-validation. ViT-2SPN achieves a mean AUC of 0.93, accuracy of 0.77, precision of 0.81, recall of 0.75, and an F1 score of 0.76, outperforming existing SSP-based methods. These results underscore the robustness and clinical potential of ViT-2SPN in retinal OCT classification.
During the SSP phase, the model utilizes the unlabeled OCTMNIST dataset, which comprises 97,477 training samples. The training process is conducted with a mini-batch size of 128, a learning rate of 0.0001, and a momentum rate of 0.999, spanning a total of 50 epochs. The ViT-base architecture, pretrained on the ImageNet dataset, is employed as the backbone. In the fine-tuning phase, the model leverages 5.129% of the labeled OCTMNIST dataset, following a 10-fold cross-validation strategy. Each fold consists of 4,500 training samples and 500 validation samples, with an additional 500 samples reserved for testing. The fine-tuning process is carried out using a batch size of 16, the same learning rate from the pretraining phase, a dropout rate of 0.5, and 50 epochs
- ssp_vit2spn.py: Trains the self-supervised model using unlabeled images to extract meaningful features.
- finetune_vit2spn.py: Fine-tunes the pretrained model for classification tasks using labeled data.
Update the paths in the scripts to reflect your own, and to execute any of the scripts, you can run them as follows:
python ssp_vit2spn.py
python finetune_vit2spn.py
*@article{saraei2025vit,
title={ViT-2SPN: Vision Transformer-based Dual-Stream Self-Supervised Pretraining Networks for Retinal OCT Classification},
author={Saraei, Mohammadreza and Kozak, Igor and Lee, Eung-Joo},
journal={arXiv preprint arXiv:2501.17260},
year={2025}
}*
Footnotes
-
Please feel free to if you have any questions: mrsaraei@arizona.edu ↩