pip install video2dataset
or
git clone https://github.com/iejMac/video2dataset
cd video2dataset
pip install -e .
wget -O hdvila100m.zip https://hdvila.blob.core.windows.net/dataset/hdvila100m.zip?sp=r&st=2022-06-28T03:33:11Z&se=2026-01-01T11:33:11Z&spr=https&sv=2021-06-08&sr=b&sig=VaqQkLFDqKinfkaPNs1jJ1EQIYCB%2FUPYiqFqmjWye6Y%3D
Then unzip the metadata zip file.
unzip hdvilla100m.zip
With the metadata, we will deal with these data into parquet files by running this code:
python makeparquet.py
Once you run this, you should have a file hd_vila.parquet
with all the relevant metadata. The files are organized as:
data
├── caption_config
├── model
├── scripts
├── utils
├── makeparquet.py
├── config.yaml
├── download_hdvila.sh
├── hdvila
│ ├── hdvila_part0.jsonl
│ ├── hdvila_part1.jsonl
│ ├── hdvila_part2.jsonl
│ ├── hdvila_part3.jsonl
│ ├── hdvila_part4.jsonl
│ ├── hdvila_part5.jsonl
│ ├── hdvila_part6.jsonl
│ ├── hdvila_part7.jsonl
│ ├── hdvila_part8.jsonl
│ ├── hdvila_part9.jsonl
│ ├── hdvila_part10.jsonl
│ ├── hd_vila.parquet
Please check your path in download_hdvila.sh
before running the script for downloading the dataset:
bash download_hdvila.sh
-
Download Pretrained Captioners for Videos (Images) and Audio.
pip install gdown gdown https://drive.google.com/file/d/1vYqb0Lb_3sQ5bo6XV-FQ4n7k_0M9UMU3/view?usp=sharing tar -xvf audio_captioner.tar.gz gdown https://drive.google.com/file/d/1ZFCWZ8csMWLYsg9CWt71PJmKYpSn-FMt/view?usp=sharing tar -xvf vision_captioner.tar.gz
-
Deploy captioners for data annotation Set up the python environment for captioner.
bash setup_env.sh
Video Annotation with Captions
bash scripts/run_vision_captioner.sh
Audio Annotation with Captions
bash scripts/run_audio_captioner.sh
-
(Optional) Deploy Depth Estimator to annotate 3D contents We highly recommend you to use GeoWizard to generate high-quality 3D contents. while the shortage of GeoWizard is the inference speed of generative models. Therefore, in our practice, we use the DPT to annotate major data.