Aguvis

Overview

Aguvis is an open-source multimodal agent system developed by Salesforce & The University of Hong Kong that combines screenshot and accessibility tree information. The project is officially hosted and described in their research paper.

Key Features

Multimodal approach (screenshot + a11y tree)
Integration with GPT-4o
Multiple model variants (7B and 72B)
Joint development by Salesforce and HKU

Performance

OSWorld Results

Aguvis-72B w/ GPT-4o: 17.04%
Aguvis-72B: 10.26%

WebArena Results

Aguvis-72B: 89.2% accuracy
Aguvis-7B: 84.4% accuracy

Technical Details

Model Size: Available in 7B and 72B parameters
Input: Combined screenshot and accessibility tree
Optional GPT-4o integration
Focus on multimodal understanding
Pure vision-based framework for GUI interaction

References

Project: https://aguvis-project.github.io/
Paper: https://arxiv.org/abs/2412.04454
Code: https://github.com/xlang-ai/aguvis

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

aguvis.md

aguvis.md

Aguvis

Overview

Key Features

Performance

OSWorld Results

WebArena Results

Technical Details

References

Files

aguvis.md

Latest commit

History

aguvis.md

File metadata and controls

Aguvis

Overview

Key Features

Performance

OSWorld Results

WebArena Results

Technical Details

References