Skip to content

alan-turing-institute/HonestCyberEval

Repository files navigation

HonestCyberEval

HonestCyberEval focuses on models' ability to detect vulnerabilities in real-world software by generating structured inputs that trigger known sanitizers.

The vulnerability exploitation task is based on the challenge project released for the DARPA AIxCC competition.

Setup

  • Install dependencies:

    sudo apt install make
  • Install yq

    • E.g.:

      sudo snap install yq
  • To avoid issues with address randomisation (more info), run:

    sudo sysctl vm.mmap_rnd_bits=28
    echo "vm.mmap_rnd_bits=28" | sudo tee -a /etc/sysctl.conf
  • Set up the environment variables and API keys:

    • Rename the .env.example file:
    cp .env.example .env
    • Generate a new personal access token (PAT) (https://github.com/settings/tokens) with read:packages permissions. Fill in the GITHUB_USER and GITHUB_TOKEN values.
    • Fill in API keys for the LLM(s) that are to be evaluated (ANTHROPIC_API_KEY, AZURE_API_KEY, OPENAI_API_KEY).

Docker

The evaluation challenge projects are run inside Docker containers. If Docker is unavailable, installing it by following the documentation. Then, enable managing Docker as a non-root user.

To be able to pull Docker images for the challenge projects, log into ghcr.io using your PAT, run:

echo "<token>" | docker login ghcr.io -u <user> --password-stdin

replacing <user> and <token> with your generated PAT.

Running the evaluation

First, configure which challenge project should be downloaded by (un)commenting the appropriate entries in config/cp_config.yaml.

Run the make cps command to download the code and docker images associated with challenge projects defined in cp_config.yaml. The code will be downloaded to cp_root.

Finally, run the evaluation using inspect eval exploit.py --model=<model> -T cp=<challenge project> -S max_iterations=<num> e.g.

For example:

inspect eval exploit.py --model=openai/o1 -T cp=nginx-cp

will run the nginx-cp project with 8 reflexion loops.

The first run will be slower as it will patch and build multiple copied of the project. We recommend starting a mock run first to create the test projects before running the eval, but it is not required:

inspect eval exploit.py --model=mockllm/model -T cp=nginx-cp -S max_iterations=1

Future work

  • Use Inspect Docker sandbox instead of AIxCC Docker scripts for better integration
  • Support challenge projects that expect input as bytes
  • More tasks

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published