Skip to content

Implementation of Andrew Ng's paper's feasibility based results for recovering an MDP's rewards given its optimal policy. Showed how regularization can bring down the size of the feasible set by 30% (and increase precision)

Notifications You must be signed in to change notification settings

KunalP117/Inverse-Reinforcement-Learning-Fundamentals

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 

Repository files navigation

IRL-Fundamentals

Implementation of Andrew Ng's seminal IRL paper's feasibility based results (https://ai.stanford.edu/~ang/papers/icml00-irl.pdf) for recovering an MDP's rewards given its optimal policy \pi*.

Note: Add "Helper functions" sub-folder to path.

Notation: Number of states - X, Number of actions - A

Showed how regularization can bring down the size of the feasible set by 30% (and increase precision)

  • Main file: main_irl.m

  • Used a full backup based policy iteration to compute the unique deterministic optimal MDP policy (policy_iteration.m)

  • IRL analysis: (a) Vanilla implementation : Eq. 4 in Theorem 3 (Figure 1)

    (b) Intuitive discriminatory testing : Ensuring V^{\pi*} >= V^{\pi} for all deterministic policies \pi (Figure 2). In words, finding all reward functions whose value function with policy \pi* is atleast as good as any other deterministic policy \pi.

    • Same computational and memory complexity as (a)

    (c) One-state deviation testing: Eq. 4 in Theorem 3 in the paper, but restricted only to those policies that deviate from the optimal policy in only ONE state (Figure 3). Can be showed that (c) is equivalent to (a).

    • Massive drop in computational and memory complexity: Resulting constraint matrix has dimensions [X*(A-1) x X] compared to [ (A^X - 1) x X] that grows exponentially with number of states X.
    • Performance EQUIVALENT to (a)

    (d) Implementation of (c) with regularization: 3 vertically concatenated variants of Constraint matrix in (c). First copy is the same as (c), Second copy: Constraint matrix of (c) + LI (regularization term), Third copy: Constraint matrix of (c) - LI (regularization term).

    • Intuition: Due to smoothness, continuity of optimal policy with reward function, one can expect that the optimal policy stays optimal even if the reward is tweaked slightly. (d) tests for: (i) rewards R, R + L, R-L for which the optimal policy is \pi*.
    • 3x memory and computational complexity compared to (c), but yields a feasible set upto 30% smaller (Increased Precision).
    • Careful with L. Too large L will cause the true reward to fail the feasibility test.

Notes: I was initially expecting (b) to outperform (a), but simulations showed otherwise. Due to the beautiful infinite horizon structure, the "Best" one can do is ensure one step sub-optimal policy choice followed by optimal policy choice is worse off, for all sub-optimal policies. Implementation (a) tests this condition. I realized (a) is a subset of (b) since one-step sub-optimal choice being worse off than optimal policy choice from step 1 "implies" infinitely many steps of sub-optimal choice is worse off compared to the optimal choice.

About

Implementation of Andrew Ng's paper's feasibility based results for recovering an MDP's rewards given its optimal policy. Showed how regularization can bring down the size of the feasible set by 30% (and increase precision)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages