-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathindexold.html
418 lines (418 loc) · 20.7 KB
/
indexold.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
<h1 id="appendix">Appendix</h1>
<h2 id="reflexion">Reflexion</h2>
<h3 id="architecture-description-using-falaa">Architecture Description
using <code>FALAA</code></h3>
<p>The following sections present the information retrieval and prompt
creation fragments of the <code>Planner</code> and
<code>Reflector</code> components of the <code>Reflexion</code> agent in
Figures 1 and 2, respectively.</p>
<h4 id="figure-1">Figure 1</h4>
<p>UML sequence diagram of the information retrieval process for
<code>Planner</code> to generate a prompt. In this flow,
<code>Planner</code> uses internal actions of type
<code>retrieval</code>, represented by the <em>get</em> methods of the
<code>episodic</code>, <code>procedural</code>, and
<code>semantic</code> memories, from which examples are obtained along
with the allowed actions and instructions. These are condensed into the
<em>PlannerPrompt</em> object, aiming to guide the behavior of the
<code>Planner</code> LLM to generate actions.</p>
<figure>
<img
src="img/abstraccion_arquitecturas/reflexion/retrieval_planner_info_sequence_diagram.png"
alt="UML sequence diagram of the information retrieval process for Planner." />
<figcaption aria-hidden="true">UML sequence diagram of the information
retrieval process for <code>Planner</code>.</figcaption>
</figure>
<h4 id="figure-2">Figure 2</h4>
<p>UML sequence diagram of the information retrieval process for
<code>Reflector</code> to generate a prompt. In this flow,
<code>Reflector</code> uses internal actions of type
<code>retrieval</code>, represented by the <em>get</em> methods of the
<code>episodic</code> and <code>procedural</code> memories, from which
examples for the LLM and the allowed actions along with the instructions
to guide the LLM behavior are obtained. These, together with the task
and current trajectory retrieved from the
<code>short-term memory</code>, are used to generate the prompt.</p>
<figure>
<img
src="img/abstraccion_arquitecturas/reflexion/retrieval_reflector_info_sequence_diagram.png"
alt="UML sequence diagram of the information retrieval process for Reflector." />
<figcaption aria-hidden="true">UML sequence diagram of the information
retrieval process for <code>Reflector</code>.</figcaption>
</figure>
<hr />
<h2 id="retroformer">Retroformer</h2>
<h3 id="pseudo-code">Pseudo Code</h3>
<p>Figure 1 shows the training process based on RLHF, which was
ultimately used as a guide to decide whether <code>Retroformer</code>
would be trained for one or multiple environments.</p>
<h4 id="figure-1-1">Figure 1</h4>
<p>Diagram outlining the RLHF-based training process, consisting of
three steps: Obtaining training data; Training a reward model with
Supervised Learning; and fine-tuning the <em>retrospective model</em>
through response ranking using PPO.</p>
<figure>
<img
src="img/background/descripcionArquitecturas/retroformer/learning_algoritm.png"
alt="RLHF-based training process" />
<figcaption aria-hidden="true">RLHF-based training process</figcaption>
</figure>
<hr />
<h3 id="architecture-description-using-falaa-1">Architecture Description
using <code>FALAA</code></h3>
<h4 id="conceptual-description-level">(1) Conceptual Description
Level</h4>
<p>Figure 2 shows the UML class diagram of a <code>Retroformer</code>
agent.</p>
<h4 id="figure-2-1">Figure 2</h4>
<p>UML class diagram of a <code>Retroformer</code> agent.</p>
<figure>
<img src="img/abstraccion_arquitecturas/retroformer/class_diagram.png"
alt="UML class diagram of a Retroformer agent" />
<figcaption aria-hidden="true">UML class diagram of a
<code>Retroformer</code> agent</figcaption>
</figure>
<hr />
<h3 id="components-description">Components Description</h3>
<p><code>Retroformer</code> originally proposes three fundamental
components: an <em>Actor</em>, a <em>Retrospective model</em>, and a
memory module (short-term and long-term memory, as well as a <em>replay
buffer</em>). Along with these components, a reward model is trained and
used. Each component is presented below, providing the original
definition and its respective standardized adaptation:</p>
<ol type="1">
<li><strong>Actor</strong>
<ul>
<li><strong>Original Definition</strong>: The <strong>Actor</strong> is
described, like in <code>Reflexion</code>, as the component responsible
for solving the assigned task, using an LLM to generate and execute
actions in an environment.</li>
<li><strong>Adaptation in <code>FALAA</code></strong>:
<ul>
<li>Defined as an additional component, consisting of two main
subcomponents: the <code>Planner</code>, which is responsible for
planning actions, and the <code>Executor</code>, which executes
them.</li>
<li>The original concept of “solving the task” is maintained but adapted
to fit within the <code>FALAA</code> structure for better integration
and compatibility with other agents.</li>
</ul></li>
</ul></li>
<li><strong>Retrospective Model</strong>
<ul>
<li><strong>Original Definition</strong>: It consists of a smaller LLM
compared to the <em>Actor</em> and is responsible for producing
reflections (<em>feedback</em>) when the <em>Actor</em> fails in the
task.</li>
<li><strong>Adaptation in <code>FALAA</code></strong>:
<ul>
<li>Associated with the <code>Reflector</code> component, whose function
is to generate reflections and insights on failures or possible
improvements in the <code>Actor</code>’s strategy.</li>
</ul></li>
</ul></li>
<li><strong>Reward Model</strong>
<ul>
<li><strong>Original Definition</strong>: Specifically trained by
<code>Retroformer</code> to evaluate the quality of reflections
generated by the <em>Retrospective model</em>.</li>
<li><strong>Adaptation in <code>FALAA</code></strong>:
<ul>
<li>Designated as an independent object within the architecture, similar
to LLMs, dedicated to evaluation and reward assignment.</li>
</ul></li>
</ul></li>
<li><strong>Memory Module</strong>
<ul>
<li><strong>Original Definition</strong>:
<ul>
<li><strong>Short-term memory</strong>: Stores the trajectory (actions,
observations, rewards) generated by the <em>Actor</em>.</li>
<li><strong>Long-term memory</strong>: Saves reflections produced by the
<em>Retrospective model</em>.</li>
<li><strong>Replay buffer</strong>: Specialized memory for storing
triplets of prompts, reflections, and the accumulated reward of the
episode.</li>
</ul></li>
<li><strong>Adaptation in <code>FALAA</code></strong>:
<ul>
<li><code>Short-term memory</code>: Under the standard structure, it
stores a series of attributes, including the trajectory generated by the
<code>Actor</code> and the accumulated rewards of that trajectory, as
well as attributes aimed at the training process of
<code>Retroformer</code>.</li>
<li><code>Long-term memory for reflections</code>: Reflections are
stored in the <code>semantic memory</code> due to its nature of storing
knowledge.</li>
<li><code>Replay buffer</code>: Established as a dictionary that stores
a triplet of prompts, reflections, and their classification as correct
or incorrect for various environments. Stored in the
<code>episodic memory</code> due to its nature of storing past
experiences.</li>
</ul></li>
</ul></li>
</ol>
<hr />
<p><code>Retroformer</code> utilizes all structures proposed by
<code>FALAA</code>, with the <code>procedural memory</code> playing a
key role due to the architecture’s training processes. This memory in
the <code>Retroformer</code> architecture can execute <em>learning
actions</em> through methods that implicitly update stored information,
modifying the LLM and neural network-based reward model policies.</p>
<p>The training process also influences the
<code>short-term memory</code>, which stores various information such as
<em>train_data</em>. The trajectory stored in the
<code>short-term memory</code> also undergoes modifications, adopting a
sequence that now includes the elements <code>Action</code>,
<code>Observation</code>, and <code>Reward</code>, reflecting the new
architectural needs for a more structured and efficient process.</p>
<hr />
<p>We define several Objects representing prompts and examples used in
the creation of actions and reflections by <code>Planner</code> and
<code>Reflector</code>, respectively. This is to make their composition
explicit, as in implementations they are often text strings or token
lists.</p>
<p>The <code>Evaluator</code> component plays a crucial role within the
architecture. Since it is responsible for generating evaluations and
thus rewards, it integrates the reward model dedicated to evaluating the
quality of reflections. <code>Evaluator</code> offers various functional
methods, including: - The evaluation of an action-observation pair
through a reward function. - Calculating the sum of rewards in a given
trajectory. - Determining whether an action corresponds to a final
response. - Assessing the quality of a reflection.</p>
<figure>
<img
src="img/abstraccion_arquitecturas/retroformer/sequence_diagram.png"
alt="UML main sequence diagram of the Retroformer LLM-based agent." />
<figcaption aria-hidden="true">UML main sequence diagram of the
Retroformer LLM-based agent.</figcaption>
</figure>
<p>In general, the architecture of <code>Retroformer</code> partially
retains the structure of <code>Reflexion</code>, but introduces
modifications and new functionalities, such as the use of rewards for
each step of the trajectory or new attributes in the
<code>short-term memory</code> used in the training processes of the
reward model and retrospective model.</p>
<p><strong>Behavior Description:</strong><br />
Figure 1 shows the main UML sequence diagram of a
<code>Retroformer</code> agent. Before interacting with the user,
<code>Retroformer</code> exists in a certain environment, so its
memories must be initialized (this fragment is identical to
<code>Reflexion</code> in Figure 2). Once initialized, an infinite loop
begins where the agent receives a new task from the user, which is added
to the <code>short-term memory</code> before proceeding with the work
cycle (condensed under the <em>ref</em> fragment referencing Figure 3).
After completing this process, the generated response is retrieved from
the <code>short-term memory</code> and returned to the user. Similar to
<code>Reflexion</code>, it is not explicitly described what should
happen if the response is not found, so we assume that the agent does
not perform any additional actions. Finally, the
<code>short-term memory</code> is cleared using its <code>Reset()</code>
method, leaving the agent ready to receive a new task.</p>
<hr />
<h3 id="work-cycle-of-the-retroformer-agent">Work Cycle of the
Retroformer Agent</h3>
<figure>
<img
src="img/abstraccion_arquitecturas/retroformer/work_loop_sequence_diagram.png"
alt="UML sequence diagram of the work cycle of a Retroformer agent." />
<figcaption aria-hidden="true">UML sequence diagram of the work cycle of
a <code>Retroformer</code> agent.</figcaption>
</figure>
<p>Regarding the referenced work cycle (see Figure 2), it is observed
that the <code>Retroformer</code> agent maintains a work cycle similar
to that of <code>Reflexion</code>, consisting of a loop running from 1
to the maximum value <em>episode_limit</em>, which indicates the number
of failed responses allowed before ending the cycle. The loop continues
as long as the response generated in the previous iteration is not
correct. In each iteration, the <code>Actor</code> attempts to solve the
task (see Figure 3), returning the sum of the accumulated rewards in its
trajectory along with an indication of whether the task was successfully
resolved. If the task is not successfully resolved, the process of
generating reflections is executed, introducing a new flow (see Figure
4). Finally, a new iteration begins.</p>
<hr />
<h3 id="task-resolution-process-of-the-actor-component">Task Resolution
Process of the Actor Component</h3>
<p>The task resolution process of the <code>Actor</code>, as shown in
Figure 3, remains as general as possible, since <code>Retroformer</code>
allows replacing the <code>Actor</code> with components from other
agents. The task resolution flow is based on two parts:</p>
<ol type="1">
<li><p><strong>Part One</strong>: A loop with a maximum of
<em>step_limit</em> steps, where in each step, <code>Planner</code> is
asked to generate an action (the planning and corresponding information
retrieval process for prompt creation is the same as in
<code>Reflexion</code>, see Figures 5 and 6, respectively). The
generated action is then evaluated to determine if it is the final
response, which would end the loop. If not, <code>Executor</code>
executes the action, receiving an observation that is added to the
trajectory in the <code>short-term memory</code> (the process is the
same as in <code>Reflexion</code>, see Figure 7). Then, the
action-observation pair is evaluated using the reward function of
<code>Evaluator</code>, returning a score for this step, which is
converted into a <em>Reward</em> object and added to the
<code>short-term memory</code>.</p></li>
<li><p><strong>Part Two</strong>: Occurs when the loop ends, either by
reaching the step limit or by finding the correct response. The current
episode is then evaluated, where the accumulated rewards from the
current trajectory are obtained, and it is determined whether the
response is correct or not, returning these two values.</p></li>
</ol>
<figure>
<img
src="img/abstraccion_arquitecturas/retroformer/actor_sequence_diagram.png"
alt="UML sequence diagram of the task resolution process by the Actor component of a Retroformer agent." />
<figcaption aria-hidden="true">UML sequence diagram of the task
resolution process by the <code>Actor</code> component of a
<code>Retroformer</code> agent.</figcaption>
</figure>
<hr />
<h3 id="reflection-process-of-the-reflector-component">Reflection
Process of the Reflector Component</h3>
<figure>
<img
src="img/abstraccion_arquitecturas/retroformer/reflexion_sequence_diagram.png"
alt="UML sequence diagram of the reflection process of the Reflector component of a Retroformer agent." />
<figcaption aria-hidden="true">UML sequence diagram of the reflection
process of the <code>Reflector</code> component of a
<code>Retroformer</code> agent.</figcaption>
</figure>
<p>The process of selecting the best reflection from the
<code>Reflector</code> component of a <code>Retroformer</code> agent,
shown in Figure 1, is the main innovation of this architecture. It
illustrates how <code>Reflector</code> generates
<code>reflection_tries</code> reflections in a loop. In each step of
this loop: 1. A reflection is generated using the <code>reflect()</code>
method (see Figure 2). 2. A prompt composed of information from
short-term and long-term memories (see Figure 3) is used to generate a
reflection, which is then returned.</p>
<p>Once the reflection is generated, it is evaluated using the
<code>Evaluator</code> component, which acts as an intermediary. Using
the prompt and the generated reflection, <code>Evaluator</code> applies
its pre-trained neural network model to assess the quality of the
reflection. If the reflection is better than the best one stored so far,
it is saved as the best reflection in the
<code>short-term memory</code>. At the end of the cycle, the best
reflection is stored in the <code>semantic memory</code>, ensuring that
the storage limit is not exceeded.</p>
<hr />
<h3 id="best-reflection-selection-process">Best Reflection Selection
Process</h3>
<figure>
<img
src="img/abstraccion_arquitecturas/retroformer/multi_reflection_sequence_diagram.png"
alt="UML sequence diagram of the best reflection selection process for the Reflector component of a Retroformer agent." />
<figcaption aria-hidden="true">UML sequence diagram of the best
reflection selection process for the <code>Reflector</code> component of
a <code>Retroformer</code> agent.</figcaption>
</figure>
<hr />
<h3 id="information-retrieval-process-for-reflection">Information
Retrieval Process for Reflection</h3>
<figure>
<img
src="img/abstraccion_arquitecturas/retroformer/retrieval_reflector_info_sequence_diagram.png"
alt="UML sequence diagram of the information retrieval process for reflection in the Retroformer agent." />
<figcaption aria-hidden="true">UML sequence diagram of the information
retrieval process for reflection in the <code>Retroformer</code>
agent.</figcaption>
</figure>
<p>In this flow, <code>Reflector</code> uses internal
<code>retrieval</code> actions, represented by the <em>get</em> methods
of the <code>episodic</code> and <code>procedural</code> memories, from
which examples for the LLM and the allowed actions along with
instructions to guide the LLM’s behavior are obtained. These, together
with the current task and trajectory retrieved from the
<code>short-term memory</code>, are used to generate the prompt. This
prompt, before being returned, is stored in the
<code>short-term memory</code>.</p>
<hr />
<h3 id="training-process-of-the-retroformer-agent">Training Process of
the Retroformer Agent</h3>
<figure>
<img
src="img/abstraccion_arquitecturas/retroformer/train_main_sequence_diagram.png"
alt="Main UML sequence diagram of the RLHF-based training process of a Retroformer agent." />
<figcaption aria-hidden="true">Main UML sequence diagram of the
RLHF-based training process of a <code>Retroformer</code>
agent.</figcaption>
</figure>
<p><strong>Training Process:</strong><br />
<code>Retroformer</code> employs an RLHF-based training method, where it
trains a neural network model using supervised learning to evaluate
reflections, along with the supervised learning-based training of the
LLM contained in <code>Reflector</code>. The training process consists
of the three RLHF steps: 1. <strong>Data Collection</strong>: Figure 4
illustrates this step. 2. <strong>Training a Reward Model</strong>:
Figure 5 illustrates this step. 3. <strong>Fine-Tuning the
Reflector</strong>: Figure 6 illustrates this step.</p>
<hr />
<h3 id="data-collection-process">Data Collection Process</h3>
<figure>
<img
src="img/abstraccion_arquitecturas/retroformer/train_step_1_sequence_diagram.png"
alt="UML sequence diagram of Step 1: Data collection for a Retroformer agent." />
<figcaption aria-hidden="true">UML sequence diagram of Step 1: Data
collection for a <code>Retroformer</code> agent.</figcaption>
</figure>
<p>To avoid human intervention in labeling good and bad reflections, the
difference between the accumulated rewards of two successive episodes is
used as a signal to classify reflections into good and bad
categories.</p>
<hr />
<h3 id="parallel-reflection-and-replay-buffer-storage">Parallel
Reflection and Replay Buffer Storage</h3>
<figure>
<img
src="img/abstraccion_arquitecturas/retroformer/train_parallel_reflect_sequence_diagram.png"
alt="UML sequence diagram of the flow that creates training data based on two reflections of a Retroformer agent." />
<figcaption aria-hidden="true">UML sequence diagram of the flow that
creates training data based on two reflections of a
<code>Retroformer</code> agent.</figcaption>
</figure>
<figure>
<img
src="img/abstraccion_arquitecturas/retroformer/train_fill_replay_buffer_sequence_diagram.png"
alt="UML sequence diagram showing the process of filling the replay buffer of the episodic memory in a Retroformer agent." />
<figcaption aria-hidden="true">UML sequence diagram showing the process
of filling the <em>replay buffer</em> of the
<code>episodic memory</code> in a <code>Retroformer</code>
agent.</figcaption>
</figure>
<hr />
<h3 id="reward-model-training-process">Reward Model Training
Process</h3>
<figure>
<img
src="img/abstraccion_arquitecturas/retroformer/train_step_2_sequence_diagram.png"
alt="UML sequence diagram of Step 2 in RLHF-based training of a neural network reward model for a Retroformer agent." />
<figcaption aria-hidden="true">UML sequence diagram of Step 2 in
RLHF-based training of a neural network reward model for a
<code>Retroformer</code> agent.</figcaption>
</figure>
<hr />
<h3 id="reflector-fine-tuning-process">Reflector Fine-Tuning
Process</h3>
<figure>
<img
src="img/abstraccion_arquitecturas/retroformer/train_step_3_sequence_diagram.png"
alt="UML sequence diagram of Step 3 in RLHF-based training of a Retroformer agent, where the LLM of the Reflector is trained using reinforcement learning." />
<figcaption aria-hidden="true">UML sequence diagram of Step 3 in
RLHF-based training of a <code>Retroformer</code> agent, where the LLM
of the <code>Reflector</code> is trained using reinforcement
learning.</figcaption>
</figure>
<hr />
<h3 id="reflector-fine-tuning-using-ppo-algorithm">Reflector Fine-Tuning
Using PPO Algorithm</h3>
<figure>
<img
src="img/abstraccion_arquitecturas/retroformer/train_reflector_sequence_diagram.png"
alt="UML sequence diagram where, given a reflection, it is evaluated using the reward model of Evaluator, and this score is then used to update the policy of the LLM in the Reflector component using the PPO algorithm." />
<figcaption aria-hidden="true">UML sequence diagram where, given a
reflection, it is evaluated using the reward model of
<code>Evaluator</code>, and this score is then used to update the policy
of the LLM in the <code>Reflector</code> component using the PPO
algorithm.</figcaption>
</figure>