ViT-2SPN

Vision Transformer-based Dual-Stream Self-Supervised Pretraining Networks for Retinal OCT Classification

Mohammadreza Saraei ¹, Dr. Igor Kozak (Website), Dr. Eung-Joo Lee (Website)

Code [GitHub] | Data [MedMNISTv2] | Preprint [ArXiv] | Publication [Under Review in MIDL 2025]

Optical Coherence Tomography (OCT) is a non-invasive imaging modality essential for diagnosing various eye diseases. Despite its clinical significance, developing OCT-based diagnostic tools faces challenges, such as limited public datasets, sparse annotations, and privacy concerns. Although deep learning has made progress in automating OCT analysis, these challenges remain unresolved. To address these limitations, we introduce the Vision Transformer-based Dual-Stream Self-Supervised Pretraining Network (ViT-2SPN), a novel framework designed to enhance feature extraction and improve diagnostic accuracy. ViT-2SPN employs a three-stage workflow: Supervised Pretraining, Self-Supervised Pretraining (SSP), and Supervised Fine-Tuning. The pretraining phase leverages the OCTMNIST dataset (97,477 unlabeled images across four disease classes) with data augmentation to create dual-augmented views. A Vision Transformer (ViT-Base) backbone extracts features, while a negative cosine similarity loss aligns feature representations. Pretraining is conducted over 50 epochs with a learning rate of 0.0001 and momentum of 0.999. Fine-tuning is performed on a stratified 5.129\% subset of OCTMNIST using 10-fold cross-validation. ViT-2SPN achieves a mean AUC of 0.93, accuracy of 0.77, precision of 0.81, recall of 0.75, and an F1 score of 0.76, outperforming existing SSP-based methods. These results underscore the robustness and clinical potential of ViT-2SPN in retinal OCT classification.

Data Samples (Class: Normal, Drusen, DME, CNV)

ViT-2SPN Architecture

Experimental Setup

During the SSP phase, the model utilizes the unlabeled OCTMNIST dataset, which comprises 97,477 training samples. The training process is conducted with a mini-batch size of 128, a learning rate of 0.0001, and a momentum rate of 0.999, spanning a total of 50 epochs. The ViT-base architecture, pretrained on the ImageNet dataset, is employed as the backbone. In the fine-tuning phase, the model leverages 5.129% of the labeled OCTMNIST dataset, following a 10-fold cross-validation strategy. Each fold consists of 4,500 training samples and 500 validation samples, with an additional 500 samples reserved for testing. The fine-tuning process is carried out using a batch size of 16, the same learning rate from the pretraining phase, a dropout rate of 0.5, and 50 epochs

Result

Performance Improvement

Command

ssp_vit2spn.py: Trains the self-supervised model using unlabeled images to extract meaningful features.
finetune_vit2spn.py: Fine-tunes the pretrained model for classification tasks using labeled data.

Usage

Update the paths in the scripts to reflect your own, and to execute any of the scripts, you can run them as follows:

python ssp_vit2spn.py
python finetune_vit2spn.py

Citation (BibTeX)

*@article{saraei2025vit,
  title={ViT-2SPN: Vision Transformer-based Dual-Stream Self-Supervised Pretraining Networks for Retinal OCT Classification},
  author={Saraei, Mohammadreza and Kozak, Igor and Lee, Eung-Joo},
  journal={arXiv preprint arXiv:2501.17260},
  year={2025}
}*

Please feel free to if you have any questions: mrsaraei@arizona.edu ↩

Name		Name	Last commit message	Last commit date
Latest commit History 95 Commits
byol		byol
figures		figures
moco		moco
mocov2		mocov2
mocov3		mocov3
result		result
simclrv2		simclrv2
simsiam		simsiam
smclr		smclr
swav		swav
README.md		README.md
finetune_vit2spn.py		finetune_vit2spn.py
ssp_vit2spn.py		ssp_vit2spn.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ViT-2SPN

Data Samples (Class: Normal, Drusen, DME, CNV)

ViT-2SPN Architecture

Experimental Setup

Result

Performance Improvement

Command

Usage

Citation (BibTeX)

About

Releases

Packages

Languages

mrsaraei/ViT-2SPN

Folders and files

Latest commit

History

Repository files navigation

ViT-2SPN

Data Samples (Class: Normal, Drusen, DME, CNV)

ViT-2SPN Architecture

Experimental Setup

Result

Performance Improvement

Command

Usage

Citation (BibTeX)

Footnotes

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages