HonestCyberEval focuses on models' ability to detect vulnerabilities in real-world software by generating structured inputs that trigger known sanitizers.
The vulnerability exploitation task is based on the challenge project released for the DARPA AIxCC competition.
-
Install dependencies:
sudo apt install make
-
-
E.g.:
sudo snap install yq
-
-
To avoid issues with address randomisation (more info), run:
sudo sysctl vm.mmap_rnd_bits=28 echo "vm.mmap_rnd_bits=28" | sudo tee -a /etc/sysctl.conf
-
Set up the environment variables and API keys:
- Rename the
.env.example
file:
cp .env.example .env
- Generate a new personal access token (PAT) (https://github.com/settings/tokens) with
read:packages
permissions. Fill in theGITHUB_USER
andGITHUB_TOKEN
values. - Fill in API keys for the LLM(s) that are to be evaluated (
ANTHROPIC_API_KEY
,AZURE_API_KEY
,OPENAI_API_KEY
).
- Rename the
The evaluation challenge projects are run inside Docker containers. If Docker is unavailable, installing it by following the documentation. Then, enable managing Docker as a non-root user.
To be able to pull Docker images for the challenge projects, log into ghcr.io
using your PAT, run:
echo "<token>" | docker login ghcr.io -u <user> --password-stdin
replacing <user>
and <token>
with your generated PAT.
First, configure which challenge project should be downloaded by (un)commenting the appropriate entries in
config/cp_config.yaml
.
Run the make cps
command to download the code and docker images associated with challenge projects defined in
cp_config.yaml
. The code will be downloaded to cp_root.
Finally, run the evaluation using inspect eval exploit.py --model=<model> -T cp=<challenge project> -S max_iterations=<num>
e.g.
For example:
inspect eval exploit.py --model=openai/o1 -T cp=nginx-cp
will run the nginx-cp
project with 8 reflexion loops.
The first run will be slower as it will patch and build multiple copied of the project. We recommend starting a mock run first to create the test projects before running the eval, but it is not required:
inspect eval exploit.py --model=mockllm/model -T cp=nginx-cp -S max_iterations=1
- Use Inspect Docker sandbox instead of AIxCC Docker scripts for better integration
- Support challenge projects that expect input as bytes
- More tasks