-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathresearch.html
executable file
·78 lines (67 loc) · 5.1 KB
/
research.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
---
title: Research
layout: default
---
<div class="research mx-6 sm:mx-10 md:mx-14 lg:mx-20 xl:mx-auto xl:w-11/12 py-5">
<div class="xl:mx-10">
<h1 class="text-5xl md:text-6xl tracking-normal font-medium text-indigo-500 my-6 md:my-8">About</h1>
<p class="text-gray-700 dark:text-gray-300 text-xl leading-relaxed">I am currently pursuing an M.S. in Computer Science at Case Western Reserve University, where I am advised by Professor Soumya Ray. Currently, I'm working on investigating failure modes of reinforcement learning (RL), and looking at how to design more sample-efficient algorithms. More broadly, my interests include all things RL and adversarial ML.</p>
<div class="my-12 flex justify-center">
<div class="button">
<a href="{% link /img/Resume.pdf %}" target="_blank">
Resume </a>
</div>
</div>
<h1 class="text-5xl md:text-6xl tracking-normal font-medium text-indigo-500 mt-12 md:mt-16">Recent Work</h1>
</div>
<div class="grid grid-cols-1 xl:grid-cols-2 mb-10">
<div class="card">
<div class="px-6 pt-4 pb-6">
<div class="card-header">OPAC<sup>2</sup></div>
<p class="card-text">
Most current off-policy RL algorithms for continuous actions spaces (e.g., SAC, TD3) have a policy update that explicitly maximizes the Q-function using the reparameterization trick. When this policy update appears in combination with a value update that uses bootstrapping, the learned values are often <em>overestimated</em>. The common fix is to learn two Q-functions and take the minimum, which conversely results in an <em>underestimation</em> bias. Typically, this balances the overestimation and yields decent performance. In this work, we demonstrate a class of scenarios in which this underestimation leads to a complete failure to learn. We also propose a novel off-policy actor-critic algorithm (OPAC<sup>2</sup>) that uses a stochastic policy gradient. We find that this approach doesn't suffer from the same estimation errors and significantly outperforms baseline approaches, all at no additional training cost.
</p>
<div class="inline-flex flex-row justify-center">
<div class="flex flex-col sm:flex-row sm:flex-wrap pb-6">
<div class="card-button">
<a target="_blank" href="https://openreview.net/pdf?id=kVq667T8CG">RLC 2024 ICBINB Workshop Paper</a>
</div>
</div>
</div>
</div>
</div>
<div class="card">
<div class="px-6 pt-4 pb-6">
<div class="card-header">DARPA ANSR</div>
<p class="card-text">
The DARPA <a href='https://www.darpa.mil/news-events/2023-09-25' target="_blank">Assured Neuro Symbolic Learning and Reasoning</a> (ANSR) program focuses on developing neuro-symbolic AI systems, defined as hybrid systems composed of symbolic reasoning components as well as learned neural components. One major goal is to produce assurance guarantees about the robustness of these systems. My role while working at <a href="https://www.jhuapl.edu/work/labs-and-facilities/intelligent-systems-center" target="_blank">JHU/APL</a> was to aid in designing new evaluation protocols, adversarial attacks, and red-teaming strategies for these new neuro-symbolic systems. This also includes a framework for evaluating the assurance guarantees. Because the systems under test are composites of multiple components, the evaluations were designed to test individual components (e.g., vision, controllers) as well as the entire end-to-end system.
</p>
</div>
</div>
<div class="card">
<div class="px-6 py-4">
<div class="card-header">Active Logic</div>
<p class="card-text">
This project focused on designing a robot agent that was explicitly aware of its own behaviors. Equipped with this knowledge, the agent could reason about its own past and current behaviors, enabling it to learn from prior mistakes and improve its performance over time. Our implementation and testing used two <a href="https://robots.ros.org/assets/img/robots/mobility-base/image.jpg" target="_blank">Baxter robots</a>.
This work was done at the Artificial Intelligence and Metacognition lab at the <a href="https://www.umiacs.umd.edu" target="_blank">University of Maryland Institute for Advanced Computer Studies</a>, under principal investigator Donald Perlis.
</p>
</div>
<div class="inline-flex flex-row justify-center">
<div class="flex flex-col sm:flex-row sm:flex-wrap px-6 pb-6">
<div class="card-button">
<a target="_blank" href="https://cs.umd.edu/active">
Lab Website </a>
</div>
<div class="card-button">
<a target="_blank" href="https://github.com/mclumd">
Lab Github </a>
</div>
<div class="card-button">
<a target="_blank" href="https://mclumd.github.io/ALMECOM%20Papers/2019/Perlis%20et%20al.%20-%202019%20-%20Live%20and%20Learn%2C%20Ask%20and%20Tell%20Agents%20Over%20Tasks.pdf">
Position Paper </a>
</div>
</div>
</div>
</div>
</div>
</div>