Why does my settings are not applied and each time I start afresh, I uses the last set learning rate from previous run? #280

Coderx7 · 2023-01-17T18:56:41Z

Coderx7
Jan 17, 2023

Hi,
I'm not sure how I can reset everything and start afresh!
I tried having a test run with something like this :

(base) hossein@hossein-pc:~/torchdistill$ python -m torch.distributed.launch --nproc_per_node=1 --use_env examples/image_classification.py --world_size 1 --config configs/sample/ilsvrc2012/single_stage/kd/simv1_05m_from_simv1_5m.yaml --log log/ilsvrc2012/kd/simvnet_from_simvnet_5m_part4.txt

and it went for some epochs, then I foundout I had a mistake somewhere and stopped the process. on the next try however I noticed the learning rate is not picked up from the config file and it seems it somehow fetches the learning rate from previous run!
if I srart a new round of kd, I get sth like this, (note the learning rate):

/home/hossein/anaconda3/lib/python3.9/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

  warnings.warn(
2023/01/17 22:18:13	INFO	torchdistill.common.main_util	| distributed init (rank 0): env://
2023/01/17 22:18:13	INFO	torch.distributed.distributed_c10d	Added key: store_based_barrier_key:1 to store for rank: 0
2023/01/17 22:18:13	INFO	torch.distributed.distributed_c10d	Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 1 nodes.
2023/01/17 22:18:15	INFO	__main__	Namespace(config='configs/sample/ilsvrc2012/single_stage/kd/simv1_05m_from_simv1_5m.yaml', device='cuda', log='log/ilsvrc2012/kd/simvnet_from_simvnet_5m_part6.txt', start_epoch=0, seed=None, test_only=False, student_only=False, log_config=False, world_size=1, dist_url='env://', adjust_lr=False)
2023/01/17 22:18:15	INFO	torchdistill.datasets.util	Loading train data
2023/01/17 22:18:16	INFO	torchdistill.datasets.util	dataset_id `ImageNet_DataSet//train`: 1.3005778789520264 sec
2023/01/17 22:18:16	INFO	torchdistill.datasets.util	Loading val data
2023/01/17 22:18:16	INFO	torchdistill.datasets.util	dataset_id `ImageNet_DataSet//val`: 0.05885505676269531 sec
========smpnetv1_5m_m2=========
pretrained is True
=========================
drop_block_rate; 0.0
!!!!!! NO MODELS FOUND !!!!!
==========================
pretrained is True
pretrained_cfg: smpnetv1_5m_m2
model "smpnetv1_5m_m2" is loaded
2023/01/17 22:18:16	INFO	torchdistill.common.main_util	ckpt file is not found at `./resource/ckpt/ilsvrc2012/teacher/ImageNet_DataSet/-smpnetv1_5m_m2.pt`
drop_block_rate; 0.0
==========================
pretrained is False
pretrained_cfg: smpnetv1_05m_m1
2023/01/17 22:18:16	INFO	torchdistill.common.main_util	Loading model parameters
2023/01/17 22:18:16	INFO	__main__	Start training
2023/01/17 22:18:16	INFO	torchdistill.models.util	[teacher model]
2023/01/17 22:18:16	INFO	torchdistill.models.util	Using the original teacher model
2023/01/17 22:18:16	INFO	torchdistill.models.util	[student model]
2023/01/17 22:18:16	INFO	torchdistill.models.util	Using the original student model
2023/01/17 22:18:16	INFO	torchdistill.core.distillation	Loss = 1.0 * OrgLoss
2023/01/17 22:18:16	INFO	torchdistill.core.distillation	Freezing the whole teacher model
2023/01/17 22:18:16	INFO	torchdistill.common.main_util	Loading optimizer parameters
2023/01/17 22:18:16	INFO	torchdistill.common.main_util	Loading scheduler parameters
2023/01/17 22:18:21	INFO	torchdistill.misc.log	Epoch: [0]  [   0/5005]  eta: 6:39:29  lr: 0.0010000000000000002  img/s: 127.84080492874554  loss: 2.3775 (2.3775)  time: 4.7891  data: 2.7862  max mem: 4494

notice the learning rate here, its lr: 0.0010000000000000002 while in the config file its set as 0.1.
I deleted the saved checkpoint from previous run located in ./resources/ilsvrc2012/etc located inside my local torchdistill repo, but still to no avail!
The whole config file is given as below :

datasets:
  ilsvrc2012:
    name: &dataset_name 'ImageNet_DataSet/'
    type: 'ImageFolder'
    root: &root_dir !join ['/media/hossein/SSD_IMG/', *dataset_name]
    splits:
      train:
        dataset_id: &imagenet_train !join [*dataset_name, '/train']
        params:
          root: !join [*root_dir, '/train']
          transform_params:
            - type: 'RandomResizedCrop'
              params:
                size: &input_size [224, 224]
            - type: 'RandomHorizontalFlip'
              params:
                p: 0.5
            - &totensor
              type: 'ToTensor'
              params:
            - &normalize
              type: 'Normalize'
              params:
                mean: [0.485, 0.456, 0.406]
                std: [0.229, 0.224, 0.225]
      val:
        dataset_id: &imagenet_val !join [*dataset_name, '/val']
        params:
          root: !join [*root_dir, '/val']
          transform_params:
            - type: 'Resize'
              params:
                size: 256
            - type: 'CenterCrop'
              params:
                size: *input_size
            - *totensor
            - *normalize

models:
  teacher_model:
    name: &teacher_model_name 'smpnetv1_5m_m2'
    params:
      num_classes: 1000
      pretrained: True
    experiment: &teacher_experiment !join [*dataset_name, '-', *teacher_model_name]
    ckpt: !join ['./resource/ckpt/ilsvrc2012/teacher/', *teacher_experiment, '.pt']
  student_model:
    name: &student_model_name 'smpnetv1_05m_m1'
    params:
      num_classes: 1000
      pretrained: False
    experiment: &student_experiment !join [*dataset_name, '-', *student_model_name, '_from_', *teacher_model_name]
    ckpt: !join ['./imagenet/kd/', *student_experiment, '.pt']

train:
  log_freq: 1000
  num_epochs: 100
  train_data_loader:
    dataset_id: *imagenet_train
    random_sample: True
    batch_size: 256
    num_workers: 16
    cache_output:
  val_data_loader:
    dataset_id: *imagenet_val
    random_sample: False
    batch_size: 128
    num_workers: 16
  teacher:
    sequential: []
    wrapper: ''
    requires_grad: False
  student:
    adaptations:
    sequential: []
    wrapper: ''
    requires_grad: True
    frozen_modules: []
  optimizer:
    type: 'SGD'
    params:
      lr: 0.1
      momentum: 0.9
      weight_decay: 0.00001
  scheduler:
    type: 'MultiStepLR'
    params:
      milestones: [30, 60, 90]
      gamma: 0.1
  criterion:
    type: 'GeneralizedCustomLoss'
    org_term:
      criterion:
        type: 'KDLoss'
        params:
          temperature: 1.0
          alpha: 0.5
          reduction: 'batchmean'
      factor: 1.0
    sub_terms:

test:
  test_data_loader:
    dataset_id: *imagenet_val
    random_sample: False
    batch_size: 1
    num_workers: 16

this is a copy from the resnet18 model shipped with torchdistill, so nothing fancy here either!
What am I missing here?
Thanks a lot in advance .

Answered by yoshitomo-matsubara

Jan 17, 2023

Hi @Coderx7

See the 2nd paragraph of #242 (comment)
You should delete the previous checkpoint or change the ckpt file path in the yaml config before you run the script.
I'm also working on having separate checkpoint configurations (one for initialization and the other for saving) as part of next release

Please check closed issues and discussions next time, before you create a new issue/discussion

View full answer

yoshitomo-matsubara · 2023-01-17T20:32:07Z

yoshitomo-matsubara
Jan 17, 2023
Maintainer

Hi @Coderx7

See the 2nd paragraph of #242 (comment)
You should delete the previous checkpoint or change the ckpt file path in the yaml config before you run the script.
I'm also working on having separate checkpoint configurations (one for initialization and the other for saving) as part of next release

Please check closed issues and discussions next time, before you create a new issue/discussion

3 replies

Coderx7 Jan 18, 2023
Author

Thanks. I really appreciate your quick responses.
My bad! I didnt see that comment.

nighting0le01 Jul 25, 2023

@Coderx7 Can you please share the conda environment including torchversion and cuda version that you used for this experiment? I'm getting trouble with getting high img/s and training speed or even getting distributed run

Coderx7 Jul 25, 2023
Author

I don't have access to those files anymore. it's been several months since I asked this question and I no longer work with this library.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why does my settings are not applied and each time I start afresh, I uses the last set learning rate from previous run? #280

{{title}}

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Why does my settings are not applied and each time I start afresh, I uses the last set learning rate from previous run? #280

Coderx7 Jan 17, 2023

Replies: 1 comment · 3 replies

yoshitomo-matsubara Jan 17, 2023 Maintainer

Coderx7 Jan 18, 2023 Author

nighting0le01 Jul 25, 2023

Coderx7 Jul 25, 2023 Author

Coderx7
Jan 17, 2023

Replies: 1 comment 3 replies

yoshitomo-matsubara
Jan 17, 2023
Maintainer

Coderx7 Jan 18, 2023
Author

Coderx7 Jul 25, 2023
Author