GitHub - facebookresearch/MLGym: MLGym A New Framework and Benchmark for Advancing AI Research Agents

Introduction

This is the first Gym environment for machine learning (ML) tasks, enabling research on reinforcement learning (RL) algorithms for training such agents. MLGym-Bench consists of 13 diverse and open-ended AI research tasks from diverse domains such as computer vision, natural language processing, reinforcement learning, and game theory. Solving these tasks requires real-world AI research skills such as generating new ideas and hypotheses, creating and processing data, implementing ML methods, training models, running experiments, analyzing the results, and iterating through this process to improve on a given task.

Warning

Meta MLGym is currently an experimental framework intended for benchmarking AI Research Agents. It is under heavy development. Please except major changes to the design.

The primary goal of MLGym is to expand the selection of AI research tasks for benchmarking the LLM Agents and implementing RL algorithms to train LLMs in a research environment. main branch will always contain the latest stable release and all breaking changes will be announced in the release notes.

Installation

Clone and install dependencies

git clone git@github.com:facebookresearch/MLGym.git
cd MLGym
conda create -y -n mlgym python=3.11
conda activate mlgym
pip install -e .

Create a .env file in the MLGym directory (MLGym/.env) to save all the environment variables including API keys.

# Env variables
MLGYM_CONFIG_ROOT="<path_to_MLGYM_root>/configs"
MLGYM_TASK_CONFIG_DIR="<path_to_MLGYM_root>/configs/tasks"
MLGYM_WORKSPACE_PATH="<path_to_MLGYM_root>/workspace"
MLGYM_ENV_TIMEOUT=10000
MLGYM_ACTION_SHORT_TIMEOUT=60
MLGYM_ACTION_LONG_TIMEOUT=10000
MLGYM_MODEL_MAX_RETRIES=3

# API keys
OPENAI_API_KEY=""
ANTHROPIC_API_KEY=""

You can use either Docker or Podman to run tasks inside a container. Podman is the recommended way to run containers on macOS.
Follow the instructions here to install docker. Select the appropriate installation command based on your OS.
If you are working on a Linux machine, please install the nvidia-container-runtime. This is required to start docker containers with GPU support.
```
sudo dnf install -y nvidia-container-toolkit
```
Please skip to step 9 if you don't want to use Podman.
For Linux:
a. Follow the instructions here to install Podman.
b. Start podman socket. The last command should return a running podman socket:
```
systemctl --user enable podman.socket
systemctl --user start podman.socket
systemctl --user status podman.socket 
```
c. Redirect docker host to podman by exporting docker host env variable in bashrc or current session:
```
export DOCKER_HOST=unix:///run/user/$UID/podman/podman.sock
```
For MacOS:
a. If you use Homebrew package manager, install Podman with brew install podman. Otherwise, follow the instructions here.
b. Start the podman machine and set the docker host env variable:
```
podman machine init
podman machine start
export DOCKER_HOST=unix://$(podman machine inspect --format '{{.ConnectionInfo.PodmanSocket.Path}}')
```

Pull the container image:

docker pull aigym/mlgym-agent:latest

or

podman pull aigym/mlgym-agent:latest

Test launching a docker/podman container with GPU support

docker run -it --gpus all --name test aigym/mlgym-agent /bin/bash
ls -la
exit

Check that GPUs are available in the docker container using nvidia-smi.

Troubleshooting

If you get Nvidia CDI spec errors on linux (eg. Error: setting up CDI devices: unresolvable CDI devices nvidia.com/gpu=all), run these additional commands.

sudo mkdir /etc/cdi
sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml
sudo touch /etc/containers/nodocker

Quick Start

Docker

python run.py \
  --container_type docker \
  --task_config_path tasks/battleOfSexes.yaml \
  --model litellm:claude-3-5-sonnet-20240620 \
  --per_instance_cost_limit 4.00 \
  --agent_config_path configs/agents/default.yaml \
  --temp 1 \
  --gpus 0 \
  --max_steps 50 \
  --aliases_file ./docker/aliases.sh

Podman

python run.py \
  --container_type podman \
  --task_config_path tasks/battleOfSexes.yaml \
  --model litellm:claude-3-5-sonnet-20240620 \
  --per_instance_cost_limit 4.00 \
  --agent_config_path configs/agents/default.yaml \
  --temp 1 \
  --gpus 0 \
  --max_steps 50 \
  --aliases_file ./docker/aliases.sh

To see a full list of flags, please run python run.py --help.

Note

A detailed documentation for all parts of the MLGym framework is under construction. Please stay tuned!

Trajectory Visualizer

MLGym provides a Web UI to inspect the agent trajectories.

streamlit run demo/trajectory_visualizer.py -- --trajectory_dir <absolute_path_to_trajectories>

# An example
streamlit run demo/trajectory_visualizer.py -- --trajectory_dir $HOME/Projects/MLGym/trajectories/mlgym_bench_v0

To run the demo for MLGym, use the following command:

streamlit run demo/demo.py

Contributions and Maintenance

MLGym was built and is maintained by GenAI at Meta and UCSB NLP. We welcome contributions to MLGym. If you are interested in contributing, please see this document. Our maintenance plan can be found here.

Citation

If you find this work helpful, please consider citing us using the following:

@misc{nathani2025mlgymnewframeworkbenchmark,
      title={MLGym: A New Framework and Benchmark for Advancing AI Research Agents}, 
      author={Deepak Nathani and Lovish Madaan and Nicholas Roberts and Nikolay Bashlykov and Ajay Menon and Vincent Moens and Amar Budhiraja and Despoina Magka and Vladislav Vorotilov and Gaurav Chaurasia and Dieuwke Hupkes and Ricardo Silveira Cabral and Tatiana Shavrina and Jakob Foerster and Yoram Bachrach and William Yang Wang and Roberta Raileanu},
      year={2025},
      eprint={2502.14499},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.14499}, 
}

License

The majority of this code is licensed under CC-BY-NC 4.0 (Attribution-NonCommercial 4.0 International) license. However portions of the project are available under separate license terms: SWE-Agent and Modded-NanoGPT are released under MIT license; Gymnax and Gymnax-blines are released under Apache 2.0 License.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
assets		assets
configs		configs
data		data
demo		demo
demonstrations		demonstrations
docker		docker
docs		docs
mlgym		mlgym
results		results
scripts		scripts
tests		tests
tools		tools
trajectories/mlgym_bench_v0		trajectories/mlgym_bench_v0
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
MAINTENANCE.md		MAINTENANCE.md
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run.py		run.py
run_replay.py		run_replay.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Table of contents

Introduction

Installation

Troubleshooting

Quick Start

Docker

Podman

Trajectory Visualizer

Contributions and Maintenance

Citation

License

About

Releases

Packages

Contributors 4

Languages

License

facebookresearch/MLGym

Folders and files

Latest commit

History

Repository files navigation

Table of contents

Introduction

Installation

Troubleshooting

Quick Start

Docker

Podman

Trajectory Visualizer

Contributions and Maintenance

Citation

License

About

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages