This repository contains the StreamRoboLM model code, pre-trained checkpoints and evaluation code from "ClevrSkills: Compositional Language and Visual Reasoning in Robotics".
First, please follow the instructions at ClevrSkills to install ClevrSkills task suite, start a docker container and download the required resources for ClevrSkills. Please make sure to mount the dataset/checkpoint directories to the container.
Checkout the ClevrSkills repo to v0
branch to reproduce the results from the paper.
cd /path/to/clevrskills/repo/
git checkout v0
Clone this repository inside the ClevrSkills docker container,
export CLEVRSKILLS_PATH=/path/to/clevrskills/repo
git clone https://github.com/Qualcomm-AI-research/clevrskills-bench.git
cd clevrskills-bench
Download the ClevrSkills dataset and extract it.
The open-sourced data consists of 3 zip files, each containing data for 1 level of tasks. The zip files are split into ~10GB splits. You might require merging them before unzipping using
zip -s 0 simple_tasks.zip --out simple_tasks_merge.zip
Once you have downloaded and extracted the data, the directory will look like,
/path/to/data
|
└───simple_tasks/
| └───train/
| | ...
| └───val/
| | ...
| └───test/
| | └───match_pose/
| | | └───traj_{seed}/
| | | ...
| | └───move_without_hitting/
...
You could also generate trajectories to evaluate your policy on.
We provide StreamRoboLM (llama3) checkpoints for L0, L1, and L2 task levels respectively. The architecture uses Llama-3.2-3B-Instruct
as the base language model and is trained using LoRA. Use download_ckpt.sh to download the checkpoints.
bash scripts/download_ckpts.sh ./models/
Once completed, the directory ./models/
should contain 3 directories l0_ckpt/
, l1_ckpt/
, and l2_ckpt/
for L0, L1 and L2 checkpoints respectively.
The BSD-3 Clear License of this repository also applies to the model weights.
Once you have downloaded/generated the data and downloaded a checkpoint, please update the corresponding paths for checkpoint(checkpoint_path
), test trajectories(data_root
) and output directory(output_dir
) in eval_config.yaml. To reproduce the numbers in the paper, please use the test
split of the dataset to run the evaluation.
Then use the following command to evaluate respective checkpoint for finetuning results
PYTHONPATH=./ python scripts/main.py --config-path clevrskills_bench/eval_config.yaml --task_level {task_level}
where task_level
can take values L0
, L1
or L2
.
We recommend the use of a GPU with at least 48GB VRAM along with at least 32GB of RAM.
ClevrSkills-Bench
|
└───clevrskills_bench: core package
| |
| └───streamrobolm: code for the streamrobolm architecture
| | └───model_wrappers: wrapper that encapsulates the submodules and implements `forward`
| | | | lm_wrappers.py: abstract language model wrapper
| | | | vlm_wrappers.py: abstract vlm wrapper
| | | | robo_wrappers.py: encapsulates streamrobolm and implements forward pass
| | └───models:
| | | └───huggingface: code for llama3 architecture from huggingface and modified for streamrobolm
| | | └───vision: code for the vision head (VIT)
| | └───model.py: code to initialize the model given config parameters
| | └───robo_heads.py: implements the lstm action head
| | └───utils.py: miscellaneous utilities for the architecture code
| |
| └───evaluator.py: main code for evaluation
| |
| └───utils.py: miscellaneous utilities for evaluation
└───scripts:
| |
| └───main.py: evaluation entry script
If you find our code useful, please cite:
@article{haresh2024clevrskills,
title={ClevrSkills: Compositional Language And Visual Understanding in Robotics},
author={Haresh, Sanjay and Dijkman, Daniel and Bhattacharyya, Apratim and Memisevic, Roland},
journal={Advances in Neural Information Processing Systems},
year={2024}
}