Why does my settings are not applied and each time I start afresh, I uses the last set learning rate from previous run? #280
-
Hi, (base) hossein@hossein-pc:~/torchdistill$ python -m torch.distributed.launch --nproc_per_node=1 --use_env examples/image_classification.py --world_size 1 --config configs/sample/ilsvrc2012/single_stage/kd/simv1_05m_from_simv1_5m.yaml --log log/ilsvrc2012/kd/simvnet_from_simvnet_5m_part4.txt and it went for some epochs, then I foundout I had a mistake somewhere and stopped the process. on the next try however I noticed the learning rate is not picked up from the config file and it seems it somehow fetches the learning rate from previous run! /home/hossein/anaconda3/lib/python3.9/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
warnings.warn(
2023/01/17 22:18:13 INFO torchdistill.common.main_util | distributed init (rank 0): env://
2023/01/17 22:18:13 INFO torch.distributed.distributed_c10d Added key: store_based_barrier_key:1 to store for rank: 0
2023/01/17 22:18:13 INFO torch.distributed.distributed_c10d Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 1 nodes.
2023/01/17 22:18:15 INFO __main__ Namespace(config='configs/sample/ilsvrc2012/single_stage/kd/simv1_05m_from_simv1_5m.yaml', device='cuda', log='log/ilsvrc2012/kd/simvnet_from_simvnet_5m_part6.txt', start_epoch=0, seed=None, test_only=False, student_only=False, log_config=False, world_size=1, dist_url='env://', adjust_lr=False)
2023/01/17 22:18:15 INFO torchdistill.datasets.util Loading train data
2023/01/17 22:18:16 INFO torchdistill.datasets.util dataset_id `ImageNet_DataSet//train`: 1.3005778789520264 sec
2023/01/17 22:18:16 INFO torchdistill.datasets.util Loading val data
2023/01/17 22:18:16 INFO torchdistill.datasets.util dataset_id `ImageNet_DataSet//val`: 0.05885505676269531 sec
========smpnetv1_5m_m2=========
pretrained is True
=========================
drop_block_rate; 0.0
!!!!!! NO MODELS FOUND !!!!!
==========================
pretrained is True
pretrained_cfg: smpnetv1_5m_m2
model "smpnetv1_5m_m2" is loaded
2023/01/17 22:18:16 INFO torchdistill.common.main_util ckpt file is not found at `./resource/ckpt/ilsvrc2012/teacher/ImageNet_DataSet/-smpnetv1_5m_m2.pt`
drop_block_rate; 0.0
==========================
pretrained is False
pretrained_cfg: smpnetv1_05m_m1
2023/01/17 22:18:16 INFO torchdistill.common.main_util Loading model parameters
2023/01/17 22:18:16 INFO __main__ Start training
2023/01/17 22:18:16 INFO torchdistill.models.util [teacher model]
2023/01/17 22:18:16 INFO torchdistill.models.util Using the original teacher model
2023/01/17 22:18:16 INFO torchdistill.models.util [student model]
2023/01/17 22:18:16 INFO torchdistill.models.util Using the original student model
2023/01/17 22:18:16 INFO torchdistill.core.distillation Loss = 1.0 * OrgLoss
2023/01/17 22:18:16 INFO torchdistill.core.distillation Freezing the whole teacher model
2023/01/17 22:18:16 INFO torchdistill.common.main_util Loading optimizer parameters
2023/01/17 22:18:16 INFO torchdistill.common.main_util Loading scheduler parameters
2023/01/17 22:18:21 INFO torchdistill.misc.log Epoch: [0] [ 0/5005] eta: 6:39:29 lr: 0.0010000000000000002 img/s: 127.84080492874554 loss: 2.3775 (2.3775) time: 4.7891 data: 2.7862 max mem: 4494 notice the learning rate here, its datasets:
ilsvrc2012:
name: &dataset_name 'ImageNet_DataSet/'
type: 'ImageFolder'
root: &root_dir !join ['/media/hossein/SSD_IMG/', *dataset_name]
splits:
train:
dataset_id: &imagenet_train !join [*dataset_name, '/train']
params:
root: !join [*root_dir, '/train']
transform_params:
- type: 'RandomResizedCrop'
params:
size: &input_size [224, 224]
- type: 'RandomHorizontalFlip'
params:
p: 0.5
- &totensor
type: 'ToTensor'
params:
- &normalize
type: 'Normalize'
params:
mean: [0.485, 0.456, 0.406]
std: [0.229, 0.224, 0.225]
val:
dataset_id: &imagenet_val !join [*dataset_name, '/val']
params:
root: !join [*root_dir, '/val']
transform_params:
- type: 'Resize'
params:
size: 256
- type: 'CenterCrop'
params:
size: *input_size
- *totensor
- *normalize
models:
teacher_model:
name: &teacher_model_name 'smpnetv1_5m_m2'
params:
num_classes: 1000
pretrained: True
experiment: &teacher_experiment !join [*dataset_name, '-', *teacher_model_name]
ckpt: !join ['./resource/ckpt/ilsvrc2012/teacher/', *teacher_experiment, '.pt']
student_model:
name: &student_model_name 'smpnetv1_05m_m1'
params:
num_classes: 1000
pretrained: False
experiment: &student_experiment !join [*dataset_name, '-', *student_model_name, '_from_', *teacher_model_name]
ckpt: !join ['./imagenet/kd/', *student_experiment, '.pt']
train:
log_freq: 1000
num_epochs: 100
train_data_loader:
dataset_id: *imagenet_train
random_sample: True
batch_size: 256
num_workers: 16
cache_output:
val_data_loader:
dataset_id: *imagenet_val
random_sample: False
batch_size: 128
num_workers: 16
teacher:
sequential: []
wrapper: ''
requires_grad: False
student:
adaptations:
sequential: []
wrapper: ''
requires_grad: True
frozen_modules: []
optimizer:
type: 'SGD'
params:
lr: 0.1
momentum: 0.9
weight_decay: 0.00001
scheduler:
type: 'MultiStepLR'
params:
milestones: [30, 60, 90]
gamma: 0.1
criterion:
type: 'GeneralizedCustomLoss'
org_term:
criterion:
type: 'KDLoss'
params:
temperature: 1.0
alpha: 0.5
reduction: 'batchmean'
factor: 1.0
sub_terms:
test:
test_data_loader:
dataset_id: *imagenet_val
random_sample: False
batch_size: 1
num_workers: 16 this is a copy from the resnet18 model shipped with torchdistill, so nothing fancy here either! |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 3 replies
-
Hi @Coderx7 See the 2nd paragraph of #242 (comment) Please check closed issues and discussions next time, before you create a new issue/discussion |
Beta Was this translation helpful? Give feedback.
Hi @Coderx7
See the 2nd paragraph of #242 (comment)
You should delete the previous checkpoint or change the
ckpt
file path in the yaml config before you run the script.I'm also working on having separate checkpoint configurations (one for initialization and the other for saving) as part of next release
Please check closed issues and discussions next time, before you create a new issue/discussion