MDP_policy_functions function

Functions for MDP Policies

Functions for MDP Policies

Implementation several functions useful to deal with MDP policies.

q_values_MDP(model, U = NULL) MDP_policy_evaluation( pi, model, U = NULL, k_backups = 1000, theta = 0.001, verbose = FALSE ) greedy_MDP_action(s, Q, epsilon = 0, prob = FALSE) random_MDP_policy(model, prob = NULL) manual_MDP_policy(model, actions) greedy_MDP_policy(Q)

Arguments

  • model: an MDP problem specification.
  • U: a vector with value function representing the state utilities (expected sum of discounted rewards from that point on). If model is a solved model, then the state utilities are taken from the solution.
  • pi: a policy as a data.frame with at least columns for states and action.
  • k_backups: number of look ahead steps used for approximate policy evaluation used by the policy iteration method. Set k_backups to Inf to only use θ\theta as the stopping criterion.
  • theta: stop when the largest change in a state value is less than θ\theta.
  • verbose: logical; should progress and approximation errors be printed.
  • s: a state.
  • Q: an action value function with Q-values as a state by action matrix.
  • epsilon: an epsilon > 0 applies an epsilon-greedy policy.
  • prob: probability vector for random actions for random_MDP_policy(). a logical indicating if action probabilities should be returned for greedy_MDP_action().
  • actions: a vector with the action (either the action label or the numeric id) for each state.

Returns

q_values_MDP() returns a state by action matrix specifying the Q-function, i.e., the action value for executing each action in each state. The Q-values are calculated from the value function (U) and the transition model.

MDP_policy_evaluation() returns a vector with (approximate) state values (U).

greedy_MDP_action() returns the action with the highest q-value for state s. If prob = TRUE, then a vector with the probability for each action is returned.

random_MDP_policy() returns a data.frame with the columns state and action to define a policy.

manual_MDP_policy() returns a data.frame with the columns state and action to define a policy.

greedy_MDP_policy() returns the greedy policy given Q.

Details

Implemented functions are:

  • q_values_MDP() calculates (approximates) Q-values for a given model using the Bellman optimality equation:

q(s,a)=sT(ss,a)[R(s,a)+γU(s)] q(s,a) = \sum_{s'} T(s'|s,a) [R(s,a) + \gamma U(s')]

Q-values can be used as the input for several other functions.

  • MDP_policy_evaluation() evaluates a policy π\pi for a model and returns (approximate) state values by applying the Bellman equation as an update rule for each state and iteration kk:

Uk+1(s)=aπassT(ss,a)[R(s,a)+γUk(s)] U_{k+1}(s) =\sum_a \pi{a|s} \sum_{s'} T(s' | s,a) [R(s,a) + \gamma U_k(s')]

In each iteration, all states are updated. Updating is stopped after k_backups iterations or after the largest update Uk+1Uk\<θ||U_{k+1} - U_k||_\infty \< \theta.

  • greedy_MDP_action() returns the action with the largest Q-value given a state.

  • random_MDP_policy(), manual_MDP_policy(), and greedy_MDP_policy()

    generates different policies. These policies can be added to a problem using add_policy().

Examples

data(Maze) Maze # create several policies: # 1. optimal policy using value iteration maze_solved <- solve_MDP(Maze, method = "value_iteration") maze_solved pi_opt <- policy(maze_solved) pi_opt gridworld_plot_policy(add_policy(Maze, pi_opt), main = "Optimal Policy") # 2. a manual policy (go up and in some squares to the right) acts <- rep("up", times = length(Maze$states)) names(acts) <- Maze$states acts[c("s(1,1)", "s(1,2)", "s(1,3)")] <- "right" pi_manual <- manual_MDP_policy(Maze, acts) pi_manual gridworld_plot_policy(add_policy(Maze, pi_manual), main = "Manual Policy") # 3. a random policy set.seed(1234) pi_random <- random_MDP_policy(Maze) pi_random gridworld_plot_policy(add_policy(Maze, pi_random), main = "Random Policy") # 4. an improved policy based on one policy evaluation and # policy improvement step. u <- MDP_policy_evaluation(pi_random, Maze) q <- q_values_MDP(Maze, U = u) pi_greedy <- greedy_MDP_policy(q) pi_greedy gridworld_plot_policy(add_policy(Maze, pi_greedy), main = "Greedy Policy") #' compare the approx. value functions for the policies (we restrict #' the number of backups for the random policy since it may not converge) rbind( random = MDP_policy_evaluation(pi_random, Maze, k_backups = 100), manual = MDP_policy_evaluation(pi_manual, Maze), greedy = MDP_policy_evaluation(pi_greedy, Maze), optimal = MDP_policy_evaluation(pi_opt, Maze) ) # For many functions, we first add the policy to the problem description # to create a "solved" MDP maze_random <- add_policy(Maze, pi_random) maze_random # plotting plot_value_function(maze_random) gridworld_plot_policy(maze_random) # compare to a benchmark regret(maze_random, benchmark = maze_solved) # calculate greedy actions for state 1 q <- q_values_MDP(maze_random) q greedy_MDP_action(1, q, epsilon = 0, prob = FALSE) greedy_MDP_action(1, q, epsilon = 0, prob = TRUE) greedy_MDP_action(1, q, epsilon = .1, prob = TRUE)

References

Sutton, R. S., Barto, A. G. (2020). Reinforcement Learning: An Introduction. Second edition. The MIT Press.

See Also

Other MDP: MDP(), MDP2POMDP, accessors, actions(), add_policy(), gridworld, reachable_and_absorbing, regret(), simulate_MDP(), solve_MDP(), transition_graph(), value_function()

Author(s)

Michael Hahsler

  • Maintainer: Michael Hahsler
  • License: GPL (>= 3)
  • Last published: 2024-12-05