Skip to content

Commit

Permalink
update website
Browse files Browse the repository at this point in the history
  • Loading branch information
HenryHe0123 committed Dec 19, 2024
1 parent a8e5ed7 commit 7f207ec
Showing 1 changed file with 25 additions and 31 deletions.
56 changes: 25 additions & 31 deletions index.html
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@
More Research
</a>
<div class="navbar-dropdown">
<a class="navbar-item" href="https://github.com/GAIR-NLP/abel">
<a class="navbar-item" href="https://gair-nlp.github.io/abel">
Abel
</a>
<a class="navbar-item" href="https://gair-nlp.github.io/MathPile/">
Expand Down Expand Up @@ -98,15 +98,22 @@
<div class="column has-text-centered">
<h1 class="title is-1 publication-title">PC Agent: While You Sleep, AI Works - A Cognitive Journey into Digital World</h1>
<div class="is-size-5 publication-authors">
<span class="author-block"><a href="https://github.com/HenryHe0123">Yanheng He*</a><sup>1,3</sup>,</span>
<span class="author-block"><a href="https://github.com/zizi0123">Jiahe Jin*</a><sup>1,3</sup>,</span>
<span class="author-block"><a href="http://pfliu.com/">Pengfei Liu</a><sup>1, 2, 3+</sup></span>
<span class="author-block"><a href="https://github.com/HenryHe0123">Yanheng He</a><sup>1,2*</sup>,</span>
<span class="author-block"><a href="https://github.com/zizi0123">Jiahe Jin</a><sup>1,2*</sup>,</span>
<span class="author-block"><a href="https://shijie-xia.github.io/">Shijie Xia</a><sup>1,2</sup>,</span>
<span class="author-block"><a href="https://github.com/JoyBoy-Su">Jiadi Su</a><sup>2,4</sup>,</span>
<span class="author-block"><a href="https://rzfan525.github.io/">Runze Fan</a><sup>1,2</sup>,</span>
<span class="author-block"><a href="https://github.com/haoy-zzz">Haoyang Zou</a><sup>2,4</sup>,</span>
<span class="author-block"><a href="https://www.linkedin.com/in/xiangkun-hu-157b0122b/">Xiangkun Hu</a><sup>5</sup>,</span>
<span class="author-block"><a href="http://pfliu.com/">Pengfei Liu</a><sup>1,2,3+</sup></span>
</div>

<div class="is-size-5 publication-authors">
<span class="author-block"><sup>1</sup>Shanghai Jiao Tong University,</span>
<span class="author-block"><sup>2</sup>Shanghai Artificial Intelligence Laboratory,</span>
<span class="author-block"><sup>3</sup>Generative AI Research Lab (GAIR)</span>
<span class="author-block"><sup>2</sup>Generative AI Research Lab (GAIR)</span>
<span class="author-block"><sup>3</sup>Shanghai Artificial Intelligence Laboratory,</span>
<span class="author-block"><sup>4</sup>Fudan University</span>
<span class="author-block"><sup>5</sup>Amazon AWS AI</span>
</div>

<div class="is-size-5 publication-authors">
Expand All @@ -119,7 +126,7 @@ <h1 class="title is-1 publication-title">PC Agent: While You Sleep, AI Works - A
<div class="publication-links">
<!-- PDF Link. -->
<span class="link-block">
<a href="http://arxiv.org/abs/2407.06135" target="_blank"
<a href="http://arxiv.org/" target="_blank"
class="external-link button is-normal is-rounded is-dark">
<span class="icon">
<i class="fas fa-file-pdf"></i>
Expand Down Expand Up @@ -170,6 +177,7 @@ <h1 class="title is-1 publication-title">PC Agent: While You Sleep, AI Works - A
<section class="section">
<div class="container is-max-desktop">
<h2 class="title is-3 has-text-centered">Demo Videos</h2>
<br> <!-- 添加这一行来创建空行 -->

<!-- 第一行 -->
<div class="video-row">
Expand All @@ -178,14 +186,14 @@ <h2 class="title is-3 has-text-centered">Demo Videos</h2>
<video autoplay controls muted loop>
<source src="static/videos/Attention.mp4" type="video/mp4">
</video>
<p class="has-text-centered"><b>Demo1: Make a presentation for <i>Attention is All You Need</i></b></p>
<p class="has-text-centered"><b>Demo 1: Make a presentation for <i>Attention is All You Need</i></b></p>
</div>
<!-- Demo 2 -->
<div class="video-item">
<video autoplay controls muted loop>
<source src="static/videos/Nobel Prize.mp4" type="video/mp4">
</video>
<p class="has-text-centered"><b>Demo2: Make a presentation for Nobel Prize 2024</b></p>
<p class="has-text-centered"><b>Demo 2: Make a presentation for Nobel Prize in Physics 2024</b></p>
</div>
</div>

Expand All @@ -196,15 +204,15 @@ <h2 class="title is-3 has-text-centered">Demo Videos</h2>
<video autoplay controls muted loop>
<source src="static/videos/Turing Award.mp4" type="video/mp4">
</video>
<p class="has-text-centered"><b>Demo3: Make 11 posters for Turing Award Winners</b></p>
<p class="has-text-centered"><b>Demo 3: Make 11 posters for Turing Award Winners</b></p>
</div>
<!-- Demo 4 -->
<div class="video-item">
<video autoplay controls muted loop>
<source src="static/videos/Claude.mp4" type="video/mp4">
</video>

<p class="has-text-centered"><b>Demo4: Use Claude to build a website</b></p>
<p class="has-text-centered"><b>Demo 4: Use Claude to build a website for PC Agent</b></p>
</div>
</div>

Expand Down Expand Up @@ -238,22 +246,8 @@ <h2 id="leaderboard" class="title is-3">Overview</h2>
<h2 class="title is-3">Abstract</h2>
<div class="content has-text-justified">
<p>
While artificial intelligence has made remarkable progress in understanding and generating content like text and images, current AI systems still struggle to effectively operate real-world computers as humans do. Two critical challenges remain unsolved: foundational visual grounding and cognitive understanding for complex computer jobs. We present <b>PC Agent</b>, a digital agent that shows significant promise in autonomously navigating and operating in real-world computer environments. Our key insight is that the path to digital world lies in <b>human cognition transfer</b> - enabling AI systems to learn from human cognitive processes in computer use. This transfer is implemented through three key components: (1) PC Tracker, the first lightweight infrastructure for efficiently collecting large-scale human-computer interaction trajectories; (2) a two-stage cognition completion pipeline that transforms raw interaction data into cognitive trajectories by completing action semantics and thought processes; and (3) a multi-agent system combining a planning agent for decision-making with a grounding agent for robust visual grounding.
Our preliminary experiments with PowerPoint presentation creation tasks indicate that PC Agent, trained on just 133 cognitive trajectories, demonstrates capabilities in handling dozens of steps and cross-application operations. By open-sourcing our framework, we aim to accelerate the advancement of digital agents, making this critical step toward more capable AI systems accessible to the entire research community.
Imagine a world where AI can handle your work while you sleep - organizing your research materials, drafting a report, or creating a presentation you need for tomorrow. However, while current digital agents can perform simple tasks, they are far from capable of handling the complex real-world work that humans routinely perform. We present <strong>PC Agent</strong>, an AI system that demonstrates a crucial step toward this vision through <strong>human cognition transfer</strong>. Our key insight is that the path from executing simple "tasks" to handling complex "work" lies in efficiently capturing and learning from human cognitive processes during computer use. To validate this hypothesis, we introduce three key innovations: (1) PC Tracker, a lightweight infrastructure that efficiently collects high-quality human-computer interaction trajectories with complete cognitive context; (2) a two-stage cognition completion pipeline that transforms raw interaction data into rich cognitive trajectories by completing action semantics and thought processes; and (3) a multi-agent system combining a planning agent for decision-making with a grounding agent for robust visual grounding. Our preliminary experiments in PowerPoint presentation creation reveal that complex digital work capabilities can be achieved with a small amount of high-quality cognitive data - PC Agent, trained on just 133 cognitive trajectories, can handle sophisticated work scenarios involving up to 50 steps across multiple applications. This demonstrates the data efficiency of our approach, highlighting that the key to training capable digital agents lies in collecting human cognitive data. By open-sourcing our complete framework, including the data collection infrastructure and cognition completion methods, we aim to lower the barriers for the research community to develop truly capable digital agents.
</p>

<!-- The major functionalities of Anole are listed below:
<ul>
<li><b>Text-to-Image Generation</b></li>
<li><b>Interleaved Text-Image Generation</b></li>
<li>Text Generation</li>
<li>Fine-grained Evaluation</li>
</ul>
where <b>Bold</b> represents newly added capabilities on the basis of Chameleon. -->
<p>

</p>

</div>
</div>
</div>
Expand All @@ -273,7 +267,7 @@ <h2 id="leaderboard" class="title is-3">🔍 Methodology</h2>
</div>
<div style="text-align: left;">
<p>Based on available information and our testings, the latest release of Chameleon have demonstrated strong performance in text understanding, text generation, and multimodal understanding. Anole, build on top of Chameleon, aiming to facilitate the image generation and multimodal generation capabilities from Chameleon.</p>
<p>Chameleons pre-training data natively includes both text and image modalities, theoretically equipping it with image generation capabilities. Our goal is to facilitate this ability without compromising its text understanding, generation, and multimodal comprehension. To achieve this, we froze most of Chameleons parameters and fine-tuned only the logits corresponding to image token ids in transformers output head layer.</p>
<p>Chameleon's pre-training data natively includes both text and image modalities, theoretically equipping it with image generation capabilities. Our goal is to facilitate this ability without compromising its text understanding, generation, and multimodal comprehension. To achieve this, we froze most of Chameleon's parameters and fine-tuned only the logits corresponding to image token ids in transformer's output head layer.</p>
<p>Specifically, Anole-7b-v0.1 was developed using a small amount of image data (5,859 images, approximately 6 million image tokens) and was fine-tuned on just a few parameters (less than 40M) in a short time (around 30 minutes on 8 A100 GPUs). Despite this, Anole-7b-v0.1 expresses impressive image generation capabilities.</p>
</div>
</div>
Expand Down Expand Up @@ -306,10 +300,10 @@ <h2 class="title is-3">📬 Contact</h2>
<section class="section" id="BibTeX">
<div class="container is-max-desktop content">
<h2 class="title">BibTeX</h2>
<pre><code>@article{chern2024anole,
title={ANOLE: An Open, Autoregressive, Native Large Multimodal Models for Interleaved Image-Text Generation},
author={Chern, Ethan and Su, Jiadi and Ma, Yan and Liu, Pengfei},
journal={arXiv preprint arXiv:2407.06135},
<pre><code>@article{pcagent,
title={PC Agent: While You Sleep, AI Works - A Cognitive Journey into Digital World},
author={},
journal={},
year={2024}
} </code></pre>
</div>
Expand Down

0 comments on commit 7f207ec

Please sign in to comment.