Simulate trajectories through a MDP. The start state for each trajectory is randomly chosen using the specified belief. The belief is used to choose actions from an epsilon-greedy policy and then update the state.
start: probability distribution over the states for choosing the starting states for the trajectories. Defaults to "uniform".
horizon: epochs end once an absorbing state is reached or after the maximal number of epochs specified via horizon. If NULL then the horizon for the model is used.
epsilon: the probability of random actions for using an epsilon-greedy policy. Default for solved models is 0 and for unsolved model 1.
delta_horizon: precision used to determine the horizon for infinite-horizon problems.
return_trajectories: logical; return the complete trajectories.
engine: 'cpp' or 'r' to perform simulation using a faster C++ or a native R implementation.
verbose: report used parameters.
...: further arguments are ignored.
Returns
A list with elements:
avg_reward: The average discounted reward.
reward: Reward for each trajectory.
action_cnt: Action counts.
state_cnt: State counts.
trajectories: A data.frame with the trajectories. Each row contains the episode id, the time step, the state s, the chosen action a, the reward r, and the next state s_prime. Trajectories are only returned for return_trajectories = TRUE.
Details
A native R implementation is available (engine = 'r') and the default is a faster C++ implementation (engine = 'cpp').
Both implementations support parallel execution using the package foreach. To enable parallel execution, a parallel backend like doparallel needs to be available needs to be registered (see doParallel::registerDoParallel()). Note that small simulations are slower using parallelization. Therefore, C++ simulations with n * horizon less than 100,000 are always executed using a single worker.
Examples
# enable parallel simulation # doParallel::registerDoParallel()data(Maze)# solve the POMDP for 5 epochs and no discountingsol <- solve_MDP(Maze, discount =1)sol
# U in the policy is and estimate of the utility of being in a state when using the optimal policy.policy(sol)gridworld_matrix(sol, what ="action")## Example 1: simulate 100 trajectories following the policy, # only the final belief state is returnedsim <- simulate_MDP(sol, n =100, horizon =10, verbose =TRUE)sim
# Note that all simulations start at s_1 and that the simulated avg. reward# is therefore an estimate to the U value for the start state s_1.policy(sol)[1,]# Calculate proportion of actions taken in the simulationround_stochastic(sim$action_cnt / sum(sim$action_cnt),2)# reward distributionhist(sim$reward)## Example 2: simulate starting following a uniform distribution over all# states and return all trajectoriessim <- simulate_MDP(sol, n =100, start ="uniform", horizon =10, return_trajectories =TRUE)head(sim$trajectories)# how often was each state visited?table(sim$trajectories$s)