This is the first Gym environment for machine learning (ML) tasks, enabling research on reinforcement learning (RL) algorithms for training such agents. MLGym-Bench consists of 13 diverse and open-ended AI research tasks from diverse domains such as computer vision, natural language processing, reinforcement learning, and game theory. Solving these tasks requires real-world AI research skills such as generating new ideas and hypotheses, creating and processing data, implementing ML methods, training models, running experiments, analyzing the results, and iterating through this process to improve on a given task.
Warning
Meta MLGym is currently an experimental framework intended for benchmarking AI Research Agents. It is under heavy development. Please except major changes to the design.
The primary goal of MLGym is to expand the selection of AI research tasks for benchmarking the LLM Agents and implementing RL algorithms to train LLMs in a research environment.
main
branch will always contain the latest stable release and all breaking changes will be announced in the release notes.
-
Clone and install dependencies
git clone git@github.com:facebookresearch/MLGym.git cd MLGym conda create -y -n mlgym python=3.11 conda activate mlgym pip install -e .
-
Create a
.env
file in the MLGym directory (MLGym/.env
) to save all the environment variables including API keys.# Env variables MLGYM_CONFIG_ROOT="<path_to_MLGYM_root>/configs" MLGYM_TASK_CONFIG_DIR="<path_to_MLGYM_root>/configs/tasks" MLGYM_WORKSPACE_PATH="<path_to_MLGYM_root>/workspace" MLGYM_ENV_TIMEOUT=10000 MLGYM_ACTION_SHORT_TIMEOUT=60 MLGYM_ACTION_LONG_TIMEOUT=10000 MLGYM_MODEL_MAX_RETRIES=3 # API keys OPENAI_API_KEY="" ANTHROPIC_API_KEY=""
-
You can use either Docker or Podman to run tasks inside a container. Podman is the recommended way to run containers on macOS.
-
Follow the instructions here to install docker. Select the appropriate installation command based on your OS.
-
If you are working on a Linux machine, please install the
nvidia-container-runtime
. This is required to start docker containers with GPU support.sudo dnf install -y nvidia-container-toolkit
-
Please skip to step 9 if you don't want to use Podman.
-
For Linux:
a. Follow the instructions here to install Podman.
b. Start podman socket. The last command should return a running podman socket:systemctl --user enable podman.socket systemctl --user start podman.socket systemctl --user status podman.socket
c. Redirect docker host to podman by exporting docker host env variable in bashrc or current session:
export DOCKER_HOST=unix:///run/user/$UID/podman/podman.sock
-
For MacOS:
a. If you use Homebrew package manager, install Podman withbrew install podman
. Otherwise, follow the instructions here.
b. Start the podman machine and set the docker host env variable:podman machine init podman machine start export DOCKER_HOST=unix://$(podman machine inspect --format '{{.ConnectionInfo.PodmanSocket.Path}}')
-
Pull the container image:
docker pull aigym/mlgym-agent:latest
or
podman pull aigym/mlgym-agent:latest
-
Test launching a docker/podman container with GPU support
docker run -it --gpus all --name test aigym/mlgym-agent /bin/bash ls -la exit
-
Check that GPUs are available in the docker container using
nvidia-smi
.
If you get Nvidia CDI spec errors on linux (eg. Error: setting up CDI devices: unresolvable CDI devices nvidia.com/gpu=all
), run these additional commands.
sudo mkdir /etc/cdi
sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml
sudo touch /etc/containers/nodocker
python run.py \
--container_type docker \
--task_config_path tasks/battleOfSexes.yaml \
--model litellm:claude-3-5-sonnet-20240620 \
--per_instance_cost_limit 4.00 \
--agent_config_path configs/agents/default.yaml \
--temp 1 \
--gpus 0 \
--max_steps 50 \
--aliases_file ./docker/aliases.sh
python run.py \
--container_type podman \
--task_config_path tasks/battleOfSexes.yaml \
--model litellm:claude-3-5-sonnet-20240620 \
--per_instance_cost_limit 4.00 \
--agent_config_path configs/agents/default.yaml \
--temp 1 \
--gpus 0 \
--max_steps 50 \
--aliases_file ./docker/aliases.sh
To see a full list of flags, please run python run.py --help
.
Note
A detailed documentation for all parts of the MLGym framework is under construction. Please stay tuned!
MLGym provides a Web UI to inspect the agent trajectories.
streamlit run demo/trajectory_visualizer.py -- --trajectory_dir <absolute_path_to_trajectories>
# An example
streamlit run demo/trajectory_visualizer.py -- --trajectory_dir $HOME/Projects/MLGym/trajectories/mlgym_bench_v0
To run the demo for MLGym, use the following command:
streamlit run demo/demo.py
MLGym was built and is maintained by GenAI at Meta and UCSB NLP. We welcome contributions to MLGym. If you are interested in contributing, please see this document. Our maintenance plan can be found here.
If you find this work helpful, please consider citing us using the following:
@misc{nathani2025mlgymnewframeworkbenchmark,
title={MLGym: A New Framework and Benchmark for Advancing AI Research Agents},
author={Deepak Nathani and Lovish Madaan and Nicholas Roberts and Nikolay Bashlykov and Ajay Menon and Vincent Moens and Amar Budhiraja and Despoina Magka and Vladislav Vorotilov and Gaurav Chaurasia and Dieuwke Hupkes and Ricardo Silveira Cabral and Tatiana Shavrina and Jakob Foerster and Yoram Bachrach and William Yang Wang and Roberta Raileanu},
year={2025},
eprint={2502.14499},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2502.14499},
}
The majority of this code is licensed under CC-BY-NC 4.0 (Attribution-NonCommercial 4.0 International) license. However portions of the project are available under separate license terms: SWE-Agent and Modded-NanoGPT are released under MIT license; Gymnax and Gymnax-blines are released under Apache 2.0 License.