Skip to content

Commit

Permalink
code release
Browse files Browse the repository at this point in the history
  • Loading branch information
pritamqu committed Dec 24, 2023
1 parent 2cdd8af commit f540e4a
Show file tree
Hide file tree
Showing 50 changed files with 10,819 additions and 6 deletions.
3 changes: 1 addition & 2 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -3,5 +3,4 @@ __pycache__/
src/
checkpointing/
weights/
temp*
codes/
temp*
10 changes: 6 additions & 4 deletions README.MD
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ XKD: Cross-modal Knowledge Distillation with Domain Alignment for Video Represen
</h1>

<h3 align="center">
Under Review.
AAAI 2024
</h3>
<h3 align="center">
<a href="https://www.pritamsarkar.com">Pritam Sarkar</a>
Expand All @@ -20,9 +20,10 @@ Ali Etemad

### Updates
- [x] Paper
- [ ] Pretrained model weights
- [x] Pretrained model weights
- [ ] Evaluation codes
- [ ] Training codes
- [x] Training codes
- [ ] More documentations

#### ** Please check the project website for more details. The codes will be released soon. You may follow this repo to receive future updates. **

Expand All @@ -31,7 +32,8 @@ Ali Etemad


### Abstract
We present **XKD**, a novel self-supervised framework to learn meaningful representations from unlabelled video clips. XKD is trained with two pseudo tasks. First, masked data reconstruction is performed to learn modality-specific representations. Next, self-supervised cross-modal knowledge distillation is performed between the two modalities through teacher-student setups to learn complementary information. To identify the most effective information to transfer and also to tackle the domain gap between audio and visual modalities which could hinder knowledge transfer, we introduce a domain alignment strategy for effective cross-modal distillation. Lastly, to develop a general-purpose solution capable of handling both audio and visual streams, a modality-agnostic variant of our proposed framework is introduced, which uses the same backbone for both audio and visual modalities. Our proposed cross-modal knowledge distillation improves linear evaluation top-1 accuracy of video action classification by 8.4% on UCF101, 8.1% on HMDB51, 13.8% on Kinetics-Sound, and 14.2% on Kinetics400. Additionally, our modality-agnostic variant shows promising results in developing a general-purpose network capable of handling different data streams.
We present XKD, a novel self-supervised framework to learn meaningful representations from unlabelled videos. XKD is trained with two pseudo objectives. First, masked data reconstruction is performed to learn modality-specific representations from audio and visual streams. Next, self-supervised cross-modal knowledge distillation is performed between the two modalities through a teacher-student setup to learn complementary information. We introduce a novel domain alignment strategy to tackle domain discrepancy between audio and visual modalities enabling effective cross-modal knowledge distillation.
Additionally, to develop a general-purpose network capable of handling both audio and visual streams, modality-agnostic variants of XKD are introduced, which use the same pretrained backbone for different audio and visual tasks. Our proposed cross-modal knowledge distillation improves video action classification by 8% to 14% on UCF101, HMDB51, and Kinetics400. Additionally, XKD improves multimodal action classification by 5.5% on Kinetics-Sound. XKD shows state-of-the-art performance in sound classification on ESC50, achieving top-1 accuracy of 96.5%.



Expand Down
150 changes: 150 additions & 0 deletions codes/train/configs/pretext/kinetics400/xkd.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,150 @@
name: "XKD"
num_workers: 64 # this will be divided by number of gpus in each node
num_node: 2 # num_node multiplies batch size
apex: true # actually using pytorch amp
apex_opt_level: "O1" # "O0 for FP32 training, O1 for mixed precision training.
sync_bn: true

progress:
print_freq: 100
log2tb: true
wandb: false
wandb_all: false
dataset:
name: "kinetics400"
fold: 1
batch_size: 128 # effective batch size = cfg['num_node'] * cfg['batch_size']
clip_duration: 4.0 # duration of global view
video_fps: 8.
crop_size: 112 # 112 or 224
return_video: true
return_audio: true
audio_clip_duration: 4.0 # duration of global view
audio_fps: 16000.
hop_length: 143 # this is equal to 0.01 sec / 10 ms
audio_fps_out: 112 # when hop length is 10 ms, audio_fps_out = 112; to match with ?x16 patch
n_mels: 80 # ignore this for log-spectrogram
n_fft: 1024
vid_transform: "global_local" # strong_tc | strong_tr # combination of [RandomResizedCrop, RandomHorizontalFlip, ColorJitter, RandomGray, RandomGaussianBlur, Cutout]
aud_transform: "global_local"
train:
split: "train"
mode: "clip" # clip | video | global_local
clips_per_video: 1
aug_mode: 'train'
use_shuffle: true
drop_last: true
vid_aug_kwargs:
temporal_ratio: 4 # 32->8
spatial_ratio: 1.16 # 112-> 96
num_local_views: 1
global:
color: [0.4, 0.4, 0.4, 0.2] # [0.4, 0.4, 0.4, 0.2]
crop_scale: [0.2, 1.] #
p_flip: 0.5 # change it to 0 to turn off
p_gray: 0.0 # # change it to 0 to turn off
p_blur: 0.0 # # change it to 0 to turn off
pad_missing: false # set pad_missing to false
normalize: true
totensor: true
local:
color: [0.4, 0.4, 0.4, 0.2]
crop_scale: [0.08, 0.4] #
p_flip: 0.5 # change it to 0 to turn off
p_gray: 0.2 # # change it to 0 to turn off
p_blur: 0.5 # # change it to 0 to turn off
pad_missing: false # set pad_missing to false
normalize: true
totensor: true
aud_aug_kwargs:
temporal_ratio: 4 # local duration is clip_duration/local2global_ratio
num_local_views: 1
global:
vol: 0.1 # range in b/w -vol <--> +vol
wrap_window: 0 # act sz = 100 == 1 sec audio
voljitter: true # change it to false to turn off
timewarp: false # change it to false to turn off
randcrop: false # change it to false to turn off
normalize: true
trim_pad: false # set trim_pad to false
local:
vol: 0.2 # range in b/w -vol <--> +vol
wrap_window: 0 # act sz = 100 == 1 sec audio
virtual_crop_scale: [1.0, 1.5]
freq_scale: [0.6, 1.5]
time_scale: [0.6, 1.5]
voljitter: true # change it to false to turn off
timewarp: false # change it to false to turn off
randcrop: true # change it to false to turn off
normalize: true
trim_pad: false # set trim_pad to false

hyperparams:
num_epochs: 800 # longer training
optimizer:
name: "adamw"
betas: [0.9, 0.95]
lr:
name: "cosine"
warmup_epochs: 30
warmup_lr: 0
base_lr: 0.0001 # for batch of 256
final_lr: 0.0
weight_decay:
name: "cosine"
warmup_epochs: 0
warmup: 0
base: 0.3
final: 0.3
vid_ema:
name: "cosine"
warmup_epochs: 0
warmup: 0
base: 0.997
final: 1
aud_ema:
name: "cosine"
warmup_epochs: 0
warmup: 0
base: 0.997
final: 1
model:
name: "XKD"
kwargs: # confirm these with the setup mentioned above
frame_size: [3, 112, 112]
num_frames: 32
vid_patch_spatial: [16, 16]
vid_patch_temporal: 4
spec_size: [80, 448]
spec_patch_spatial: [4, 16]
apply_cls_token: true
teacher_cfg: 'base_encoder'
student_cfg: 'base_encoder'
decoder_cfg: 'base_decoder'
projector_cfg: '2048-gelu-3-256-8192-norm3'
center_momentum: [0.9, 0.9] # [vid, aud]
norm_pix_loss: true
masking_fn: 'random_masking'
align_loss: 'mmd'
video_temp_kwargs:
warmup_teacher_temp: 0.1 # 0.09
warmup_teacher_temp_epochs: 30
teacher_temp: 0.1
student_temp: 0.1
audio_temp_kwargs:
warmup_teacher_temp: 0.07 # 0.09
warmup_teacher_temp_epochs: 30
teacher_temp: 0.07
student_temp: 0.1
fwd_kwargs:
global_views_number: 1
vid_mask_ratio: 0.85
aud_mask_ratio: 0.80
cm_attn_mode: 'mean' # mean or softmax
align_mode: 1 # 't2s', 't2t', 'both'
align_coeff: 1
cmkd_coeff: 1
recon_coeff: 5
clip_grad: 0.3 # Maximal parameter gradient norm if using gradient clipping. Clipping with norm .3 ~ 1.0 can help optimization for larger ViT architectures. 0 for disabling.
freeze_last_layer: 0 # Number of epochs during which we keep the output layer fixed. Typically doing so during the first epoch helps training. Try increasing this value if the loss does not decrease.

150 changes: 150 additions & 0 deletions codes/train/configs/pretext/kinetics400/xkd_mas.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,150 @@
name: "XKD_MAS"
num_workers: 64 # this will be divided by number of gpus in each node
num_node: 2 # num_node multiplies batch size
apex: true # actually using pytorch amp
apex_opt_level: "O1" # "O0 for FP32 training, O1 for mixed precision training.
sync_bn: true

progress:
print_freq: 100
log2tb: true
wandb: false
wandb_all: false
dataset:
name: "kinetics400"
fold: 1
batch_size: 128 # effective batch size = cfg['num_node'] * cfg['batch_size']
clip_duration: 4.0 # duration of global view
video_fps: 8.
crop_size: 112 # 112 or 224
return_video: true
return_audio: true
audio_clip_duration: 4.0 # duration of global view
audio_fps: 16000.
hop_length: 143 # this is equal to 0.01 sec / 10 ms
audio_fps_out: 112 # when hop length is 10 ms, audio_fps_out = 112; to match with ?x16 patch
n_mels: 80 # ignore this for log-spectrogram
n_fft: 1024
vid_transform: "global_local" # strong_tc | strong_tr # combination of [RandomResizedCrop, RandomHorizontalFlip, ColorJitter, RandomGray, RandomGaussianBlur, Cutout]
aud_transform: "global_local"
train:
split: "train"
mode: "clip" # clip | video | global_local
clips_per_video: 1
aug_mode: 'train'
use_shuffle: true
drop_last: true
vid_aug_kwargs:
temporal_ratio: 4 # 32->8
spatial_ratio: 1.16 # 112-> 96
num_local_views: 3
global:
color: [0.4, 0.4, 0.4, 0.2] # [0.4, 0.4, 0.4, 0.2]
crop_scale: [0.2, 1.] #
p_flip: 0.5 # change it to 0 to turn off
p_gray: 0.0 # # change it to 0 to turn off
p_blur: 0.0 # # change it to 0 to turn off
pad_missing: false # set pad_missing to false
normalize: true
totensor: true
local:
color: [0.4, 0.4, 0.4, 0.2]
crop_scale: [0.08, 0.4] #
p_flip: 0.5 # change it to 0 to turn off
p_gray: 0.2 # # change it to 0 to turn off
p_blur: 0.5 # # change it to 0 to turn off
pad_missing: false # set pad_missing to false
normalize: true
totensor: true
aud_aug_kwargs:
temporal_ratio: 4 # local duration is clip_duration/local2global_ratio
num_local_views: 1
global:
vol: 0.1 # range in b/w -vol <--> +vol
wrap_window: 0 # act sz = 100 == 1 sec audio
voljitter: true # change it to false to turn off
timewarp: false # change it to false to turn off
randcrop: false # change it to false to turn off
normalize: true
trim_pad: false # set trim_pad to false
local:
vol: 0.2 # range in b/w -vol <--> +vol
wrap_window: 0 # act sz = 100 == 1 sec audio
virtual_crop_scale: [1.0, 1.5]
freq_scale: [0.6, 1.5]
time_scale: [0.6, 1.5]
voljitter: true # change it to false to turn off
timewarp: false # change it to false to turn off
randcrop: true # change it to false to turn off
normalize: true
trim_pad: false # set trim_pad to false

hyperparams:
num_epochs: 800 # longer training
optimizer:
name: "adamw"
betas: [0.9, 0.95]
lr:
name: "cosine"
warmup_epochs: 30
warmup_lr: 0
base_lr: 0.0001 # for batch of 256
final_lr: 0.0
weight_decay:
name: "cosine"
warmup_epochs: 0
warmup: 0
base: 0.3
final: 0.3
vid_ema:
name: "cosine"
warmup_epochs: 0
warmup: 0
base: 0.997
final: 1
aud_ema:
name: "cosine"
warmup_epochs: 0
warmup: 0
base: 0.997
final: 1
model:
name: "XKD_MAS"
kwargs: # confirm these with the setup mentioned above
frame_size: [3, 112, 112]
num_frames: 32
vid_patch_spatial: [16, 16]
vid_patch_temporal: 4
spec_size: [80, 448]
spec_patch_spatial: [4, 16]
apply_cls_token: true
teacher_cfg: 'base_encoder'
student_cfg: 'base_encoder'
decoder_cfg: 'base_decoder'
projector_cfg: '2048-gelu-3-256-8192-norm3'
center_momentum: [0.9, 0.9] # [vid, aud]
norm_pix_loss: true
masking_fn: 'random_masking'
align_loss: 'mmd'
video_temp_kwargs:
warmup_teacher_temp: 0.09 # 0.09
warmup_teacher_temp_epochs: 30
teacher_temp: 0.11
student_temp: 0.1
audio_temp_kwargs:
warmup_teacher_temp: 0.04 # 0.09
warmup_teacher_temp_epochs: 30
teacher_temp: 0.06
student_temp: 0.1
fwd_kwargs:
global_views_number: 1
vid_mask_ratio: 0.85
aud_mask_ratio: 0.80
cm_attn_mode: 'mean' # mean or softmax
align_mode: 1 # 't2s', 't2t', 'both'
align_coeff: 1
cmkd_coeff: 1
recon_coeff: 5
clip_grad: 0.3 # Maximal parameter gradient norm if using gradient clipping. Clipping with norm .3 ~ 1.0 can help optimization for larger ViT architectures. 0 for disabling.
freeze_last_layer: 0 # Number of epochs during which we keep the output layer fixed. Typically doing so during the first epoch helps training. Try increasing this value if the loss does not decrease.

Loading

0 comments on commit f540e4a

Please sign in to comment.