Implementation several functions useful to deal with MDP policies.
q_values_MDP(model, U =NULL)MDP_policy_evaluation( pi, model, U =NULL, k_backups =1000, theta =0.001, verbose =FALSE)greedy_MDP_action(s, Q, epsilon =0, prob =FALSE)random_MDP_policy(model, prob =NULL)manual_MDP_policy(model, actions)greedy_MDP_policy(Q)
Arguments
model: an MDP problem specification.
U: a vector with value function representing the state utilities (expected sum of discounted rewards from that point on). If model is a solved model, then the state utilities are taken from the solution.
pi: a policy as a data.frame with at least columns for states and action.
k_backups: number of look ahead steps used for approximate policy evaluation used by the policy iteration method. Set k_backups to Inf to only use θ as the stopping criterion.
theta: stop when the largest change in a state value is less than θ.
verbose: logical; should progress and approximation errors be printed.
s: a state.
Q: an action value function with Q-values as a state by action matrix.
epsilon: an epsilon > 0 applies an epsilon-greedy policy.
prob: probability vector for random actions for random_MDP_policy(). a logical indicating if action probabilities should be returned for greedy_MDP_action().
actions: a vector with the action (either the action label or the numeric id) for each state.
Returns
q_values_MDP() returns a state by action matrix specifying the Q-function, i.e., the action value for executing each action in each state. The Q-values are calculated from the value function (U) and the transition model.
MDP_policy_evaluation() returns a vector with (approximate) state values (U).
greedy_MDP_action() returns the action with the highest q-value for state s. If prob = TRUE, then a vector with the probability for each action is returned.
random_MDP_policy() returns a data.frame with the columns state and action to define a policy.
manual_MDP_policy() returns a data.frame with the columns state and action to define a policy.
greedy_MDP_policy() returns the greedy policy given Q.
Details
Implemented functions are:
q_values_MDP() calculates (approximates) Q-values for a given model using the Bellman optimality equation:
q(s,a)=s′∑T(s′∣s,a)[R(s,a)+γU(s′)]
Q-values can be used as the input for several other functions.
MDP_policy_evaluation() evaluates a policy π for a model and returns (approximate) state values by applying the Bellman equation as an update rule for each state and iteration k:
Uk+1(s)=a∑πa∣ss′∑T(s′∣s,a)[R(s,a)+γUk(s′)]
In each iteration, all states are updated. Updating is stopped after k_backups iterations or after the largest update ∣∣Uk+1−Uk∣∣∞\<θ.
greedy_MDP_action() returns the action with the largest Q-value given a state.
random_MDP_policy(), manual_MDP_policy(), and greedy_MDP_policy()
generates different policies. These policies can be added to a problem using add_policy().
Examples
data(Maze)Maze
# create several policies:# 1. optimal policy using value iterationmaze_solved <- solve_MDP(Maze, method ="value_iteration")maze_solved
pi_opt <- policy(maze_solved)pi_opt
gridworld_plot_policy(add_policy(Maze, pi_opt), main ="Optimal Policy")# 2. a manual policy (go up and in some squares to the right)acts <- rep("up", times = length(Maze$states))names(acts)<- Maze$states
acts[c("s(1,1)","s(1,2)","s(1,3)")]<-"right"pi_manual <- manual_MDP_policy(Maze, acts)pi_manual
gridworld_plot_policy(add_policy(Maze, pi_manual), main ="Manual Policy")# 3. a random policyset.seed(1234)pi_random <- random_MDP_policy(Maze)pi_random
gridworld_plot_policy(add_policy(Maze, pi_random), main ="Random Policy")# 4. an improved policy based on one policy evaluation and# policy improvement step.u <- MDP_policy_evaluation(pi_random, Maze)q <- q_values_MDP(Maze, U = u)pi_greedy <- greedy_MDP_policy(q)pi_greedy
gridworld_plot_policy(add_policy(Maze, pi_greedy), main ="Greedy Policy")#' compare the approx. value functions for the policies (we restrict#' the number of backups for the random policy since it may not converge)rbind( random = MDP_policy_evaluation(pi_random, Maze, k_backups =100), manual = MDP_policy_evaluation(pi_manual, Maze), greedy = MDP_policy_evaluation(pi_greedy, Maze), optimal = MDP_policy_evaluation(pi_opt, Maze))# For many functions, we first add the policy to the problem description# to create a "solved" MDPmaze_random <- add_policy(Maze, pi_random)maze_random
# plottingplot_value_function(maze_random)gridworld_plot_policy(maze_random)# compare to a benchmarkregret(maze_random, benchmark = maze_solved)# calculate greedy actions for state 1q <- q_values_MDP(maze_random)q
greedy_MDP_action(1, q, epsilon =0, prob =FALSE)greedy_MDP_action(1, q, epsilon =0, prob =TRUE)greedy_MDP_action(1, q, epsilon =.1, prob =TRUE)
References
Sutton, R. S., Barto, A. G. (2020). Reinforcement Learning: An Introduction. Second edition. The MIT Press.