Skip to content

Latest commit

 

History

History
31 lines (25 loc) · 1.03 KB

aguvis.md

File metadata and controls

31 lines (25 loc) · 1.03 KB

Aguvis

Overview

Aguvis is an open-source multimodal agent system developed by Salesforce & The University of Hong Kong that combines screenshot and accessibility tree information. The project is officially hosted and described in their research paper.

Key Features

  • Multimodal approach (screenshot + a11y tree)
  • Integration with GPT-4o
  • Multiple model variants (7B and 72B)
  • Joint development by Salesforce and HKU

Performance

OSWorld Results

  • Aguvis-72B w/ GPT-4o: 17.04%
  • Aguvis-72B: 10.26%

WebArena Results

  • Aguvis-72B: 89.2% accuracy
  • Aguvis-7B: 84.4% accuracy

Technical Details

  • Model Size: Available in 7B and 72B parameters
  • Input: Combined screenshot and accessibility tree
  • Optional GPT-4o integration
  • Focus on multimodal understanding
  • Pure vision-based framework for GUI interaction

References