POMDPs (POMDPs)
Description
At (discrete) time step $t$, the environment is assumed to be in some state $X_t$. The agent then performs an action (control) $A_t$, whereupon the environment (stochastically) changes to a new state $X_{t+1}$. The agent doesn’t see the environment state, but instead receives an observation $Y_t$, which is some (stochastic) function of $X_t$. (If $Y_t = X_t$, the POMDP reduces to a fully observed MDP.) In addition, the agent receives a special observation signal called the reward, $R_t$. The POMDP is characterized by the state transition function $P(X_{t+1}|X_t, A_t)$, the observation function $P(Y_t|X_t, A_{t−1})$, and the reward function $E(R_t|X_t, A_{t−1})$. The goal of the agent is to learn a policy $\pi$ which maps the observation history (trajectory) into an action $A_t$ to maximize $\pi$’s quality or value.
Parameters
No parameters found.
Table of Algorithms
Name | Year | Time | Space | Approximation Factor | Model | Reference |
---|---|---|---|---|---|---|
Hauskrecht; ; | 2000 | Deterministic | ||||
Pineau; Gordon; & Thrun; ; | 2003 | Deterministic | ||||
Braziunas & Boutilier; ; | 2004 | Deterministic | ||||
Poupart; ; | 2005 | Deterministic | ||||
Smith & Simmons; ; | 2005 | Deterministic | ||||
Spaan & Vlassis; | 2005 | Deterministic | ||||
Satia & Lave; ; | 1973 | Deterministic | ||||
Washington; ; | 1997 | Deterministic | ||||
Barto;Bradtke; & Singhe; ; | 1995 | Deterministic | ||||
Paquet; Tobin; & Chaib-draa; ; | 2005 | Deterministic | ||||
McAllester & Singh; ; | 1999 | Deterministic | ||||
Bertsekas & Castanon; ; | 1999 | Deterministic | ||||
Shani; Brafman; & Shimony; | 2005 | Deterministic |