ViT-2SPN

Vision Transformer-based Dual-Stream Self-Supervised Pretraining Networks for Retinal OCT Classification

Mohammadreza Saraei ¹, Dr. Igor Kozak (Website), Dr. Eung-Joo Lee (Website)

Code [GitHub] | Data [MedMNISTv2] | Preprint [ArXiv] | Publication [Under Review in MIDL 2025]

Optical Coherence Tomography (OCT) is a non-invasive imaging modality essential for diagnosing various eye diseases. Despite its clinical significance, developing OCT-based diagnostic tools faces challenges, such as limited public datasets, sparse annotations, and privacy concerns. Although deep learning has made progress in automating OCT analysis, these challenges remain unresolved. To address these limitations, we introduce the Vision Transformer-based Dual-Stream Self-Supervised Pretraining Network (ViT-2SPN), a novel framework designed to enhance feature extraction and improve diagnostic accuracy. ViT-2SPN employs a three-stage workflow: Supervised Pretraining, Self-Supervised Pretraining (SSP), and Supervised Fine-Tuning. The pretraining phase leverages the OCTMNIST dataset (97,477 unlabeled images across four disease classes) with data augmentation to create dual-augmented views. A Vision Transformer (ViT-Base) backbone extracts features, while a negative cosine similarity loss aligns feature representations. Pretraining is conducted over 50 epochs with a learning rate of 0.0001 and momentum of 0.999. Fine-tuning is performed on a stratified 5.129\% subset of OCTMNIST using 10-fold cross-validation. ViT-2SPN achieves a mean AUC of 0.93, accuracy of 0.77, precision of 0.81, recall of 0.75, and an F1 score of 0.76, outperforming existing SSP-based methods. These results underscore the robustness and clinical potential of ViT-2SPN in retinal OCT classification.

Data Samples (Class: Normal, Drusen, DME, CNV)

ViT-2SPN Architecture

Experimental Setup

During the SSP phase, the model utilizes the unlabeled OCTMNIST dataset, which comprises 97,477 training samples. The training process is conducted with a mini-batch size of 128, a learning rate of 0.0001, and a momentum rate of 0.999, spanning a total of 50 epochs. The ViT-base architecture, pretrained on the ImageNet dataset, is employed as the backbone. In the fine-tuning phase, the model leverages 5.129% of the labeled OCTMNIST dataset, following a 10-fold cross-validation strategy. Each fold consists of 4,500 training samples and 500 validation samples, with an additional 500 samples reserved for testing. The fine-tuning process is carried out using a batch size of 16, the same learning rate from the pretraining phase, a dropout rate of 0.5, and 50 epochs

Result

Performance Improvement

Command

ssp_vit2spn.py: Trains the self-supervised model using unlabeled images to extract meaningful features.
finetune_vit2spn.py: Fine-tunes the pretrained model for classification tasks using labeled data.

Usage

Update the paths in the scripts to reflect your own, and to execute any of the scripts, you can run them as follows:

python ssp_vit2spn.py
python finetune_vit2spn.py

Citation (BibTeX)

*@article{saraei2025vit,
  title={ViT-2SPN: Vision Transformer-based Dual-Stream Self-Supervised Pretraining Networks for Retinal OCT Classification},
  author={Saraei, Mohammadreza and Kozak, Igor and Lee, Eung-Joo},
  journal={arXiv preprint arXiv:2501.17260},
  year={2025}
}*

Footnotes

Please feel free to if you have any questions: mrsaraei@arizona.edu ↩

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

ViT-2SPN

Data Samples (Class: Normal, Drusen, DME, CNV)

ViT-2SPN Architecture

Experimental Setup

Result

Performance Improvement

Command

Usage

Citation (BibTeX)

Files

README.md

Latest commit

History

README.md

File metadata and controls

ViT-2SPN

Data Samples (Class: Normal, Drusen, DME, CNV)

ViT-2SPN Architecture

Experimental Setup

Result

Performance Improvement

Command

Usage

Citation (BibTeX)

Footnotes