simulate_MDP function

Simulate Trajectories in a MDP

Simulate Trajectories in a MDP

Simulate trajectories through a MDP. The start state for each trajectory is randomly chosen using the specified belief. The belief is used to choose actions from an epsilon-greedy policy and then update the state.

simulate_MDP( model, n = 100, start = NULL, horizon = NULL, epsilon = NULL, delta_horizon = 0.001, return_trajectories = FALSE, engine = "cpp", verbose = FALSE, ... )

Arguments

  • model: a MDP model.
  • n: number of trajectories.
  • start: probability distribution over the states for choosing the starting states for the trajectories. Defaults to "uniform".
  • horizon: epochs end once an absorbing state is reached or after the maximal number of epochs specified via horizon. If NULL then the horizon for the model is used.
  • epsilon: the probability of random actions for using an epsilon-greedy policy. Default for solved models is 0 and for unsolved model 1.
  • delta_horizon: precision used to determine the horizon for infinite-horizon problems.
  • return_trajectories: logical; return the complete trajectories.
  • engine: 'cpp' or 'r' to perform simulation using a faster C++ or a native R implementation.
  • verbose: report used parameters.
  • ...: further arguments are ignored.

Returns

A list with elements:

  • avg_reward: The average discounted reward.
  • reward: Reward for each trajectory.
  • action_cnt: Action counts.
  • state_cnt: State counts.
  • trajectories: A data.frame with the trajectories. Each row contains the episode id, the time step, the state s, the chosen action a, the reward r, and the next state s_prime. Trajectories are only returned for return_trajectories = TRUE.

Details

A native R implementation is available (engine = 'r') and the default is a faster C++ implementation (engine = 'cpp').

Both implementations support parallel execution using the package foreach. To enable parallel execution, a parallel backend like doparallel needs to be available needs to be registered (see doParallel::registerDoParallel()). Note that small simulations are slower using parallelization. Therefore, C++ simulations with n * horizon less than 100,000 are always executed using a single worker.

Examples

# enable parallel simulation # doParallel::registerDoParallel() data(Maze) # solve the POMDP for 5 epochs and no discounting sol <- solve_MDP(Maze, discount = 1) sol # U in the policy is and estimate of the utility of being in a state when using the optimal policy. policy(sol) gridworld_matrix(sol, what = "action") ## Example 1: simulate 100 trajectories following the policy, # only the final belief state is returned sim <- simulate_MDP(sol, n = 100, horizon = 10, verbose = TRUE) sim # Note that all simulations start at s_1 and that the simulated avg. reward # is therefore an estimate to the U value for the start state s_1. policy(sol)[1,] # Calculate proportion of actions taken in the simulation round_stochastic(sim$action_cnt / sum(sim$action_cnt), 2) # reward distribution hist(sim$reward) ## Example 2: simulate starting following a uniform distribution over all # states and return all trajectories sim <- simulate_MDP(sol, n = 100, start = "uniform", horizon = 10, return_trajectories = TRUE) head(sim$trajectories) # how often was each state visited? table(sim$trajectories$s)

See Also

Other MDP: MDP(), MDP2POMDP, MDP_policy_functions, accessors, actions(), add_policy(), gridworld, reachable_and_absorbing, regret(), solve_MDP(), transition_graph(), value_function()

Author(s)

Michael Hahsler

  • Maintainer: Michael Hahsler
  • License: GPL (>= 3)
  • Last published: 2024-12-05