simulate_POMDP function

Simulate Trajectories Through a POMDP

Simulate Trajectories Through a POMDP

Simulate trajectories through a POMDP. The start state for each trajectory is randomly chosen using the specified belief. The belief is used to choose actions from the the epsilon-greedy policy and then updated using observations.

simulate_POMDP( model, n = 1000, belief = NULL, horizon = NULL, epsilon = NULL, delta_horizon = 0.001, digits = 7L, return_beliefs = FALSE, return_trajectories = FALSE, engine = "cpp", verbose = FALSE, ... )

Arguments

  • model: a POMDP model.
  • n: number of trajectories.
  • belief: probability distribution over the states for choosing the starting states for the trajectories. Defaults to the start belief state specified in the model or "uniform".
  • horizon: number of epochs for the simulation. If NULL then the horizon for finite-horizon model is used. For infinite-horizon problems, a horizon is calculated using the discount factor.
  • epsilon: the probability of random actions for using an epsilon-greedy policy. Default for solved models is 0 and for unsolved model 1.
  • delta_horizon: precision used to determine the horizon for infinite-horizon problems.
  • digits: round probabilities for belief points.
  • return_beliefs: logical; Return all visited belief states? This requires n x horizon memory.
  • return_trajectories: logical; Return the simulated trajectories as a data.frame?
  • engine: 'cpp', 'r' to perform simulation using a faster C++ or a native R implementation.
  • verbose: report used parameters.
  • ...: further arguments are ignored.

Returns

A list with elements:

  • avg_reward: The average discounted reward.
  • action_cnt: Action counts.
  • state_cnt: State counts.
  • reward: Reward for each trajectory.
  • belief_states: A matrix with belief states as rows.
  • trajectories: A data.frame with the episode id, time, the state of the simulation (simulation_state), the id of the used alpha vector given the current belief (see belief_states above), the action a and the reward r.

Details

Simulates n trajectories. If no simulation horizon is specified, the horizon of finite-horizon problems is used. For infinite-horizon problems with γ<1\gamma < 1, the simulation horizon TT is chosen such that the worst-case error is no more than δhorizon\delta_\text{horizon}. That is

γTRmaxγδhorizon, \gamma^T \frac{R_\text{max}}{\gamma} \le \delta_\text{horizon},

where RmaxR_\text{max} is the largest possible absolute reward value used as a perpetuity starting after TT.

A native R implementation (engine = 'r') and a faster C++ implementation (engine = 'cpp') are available. Currently, only the R implementation supports multi-episode problems.

Both implementations support the simulation of trajectories in parallel using the package foreach. To enable parallel execution, a parallel backend like doparallel needs to be registered (see doParallel::registerDoParallel()). Note that small simulations are slower using parallelization. C++ simulations with n * horizon less than 100,000 are always executed using a single worker.

Examples

data(Tiger) # solve the POMDP for 5 epochs and no discounting sol <- solve_POMDP(Tiger, horizon = 5, discount = 1, method = "enum") sol policy(sol) # uncomment the following line to register a parallel backend for simulation # (needs package doparallel installed) # doParallel::registerDoParallel() # foreach::getDoParWorkers() ## Example 1: simulate 100 trajectories sim <- simulate_POMDP(sol, n = 100, verbose = TRUE) sim # calculate the percentage that each action is used in the simulation round_stochastic(sim$action_cnt / sum(sim$action_cnt), 2) # reward distribution hist(sim$reward) ## Example 2: look at the belief states and the trajectories starting with # an initial start belief. sim <- simulate_POMDP(sol, n = 100, belief = c(.5, .5), return_beliefs = TRUE, return_trajectories = TRUE) head(sim$belief_states) head(sim$trajectories) # plot with added density (the x-axis is the probability of the second belief state) plot_belief_space(sol, sample = sim$belief_states, jitter = 2, ylim = c(0, 6)) lines(density(sim$belief_states[, 2], bw = .02)); axis(2); title(ylab = "Density") ## Example 3: simulate trajectories for an unsolved POMDP which uses an epsilon of 1 # (i.e., all actions are randomized). The simulation horizon for the # infinite-horizon Tiger problem is calculated using delta_horizon. sim <- simulate_POMDP(Tiger, return_beliefs = TRUE, verbose = TRUE) sim$avg_reward hist(sim$reward, breaks = 20) plot_belief_space(sol, sample = sim$belief_states, jitter = 2, ylim = c(0, 6)) lines(density(sim$belief_states[, 1], bw = .05)); axis(2); title(ylab = "Density")

See Also

Other POMDP: MDP2POMDP, POMDP(), accessors, actions(), add_policy(), plot_belief_space(), projection(), reachable_and_absorbing, regret(), sample_belief_space(), solve_POMDP(), solve_SARSOP(), transition_graph(), update_belief(), value_function(), write_POMDP()

Author(s)

Michael Hahsler

  • Maintainer: Michael Hahsler
  • License: GPL (>= 3)
  • Last published: 2024-12-05