POMDPs (POMDPs)

From Algorithm Wiki
Jump to navigation Jump to search

Description

At (discrete) time step $t$, the environment is assumed to be in some state $X_t$. The agent then performs an action (control) $A_t$, whereupon the environment (stochastically) changes to a new state $X_{t+1}$. The agent doesn’t see the environment state, but instead receives an observation $Y_t$, which is some (stochastic) function of $X_t$. (If $Y_t = X_t$, the POMDP reduces to a fully observed MDP.) In addition, the agent receives a special observation signal called the reward, $R_t$. The POMDP is characterized by the state transition function $P(X_{t+1}|X_t, A_t)$, the observation function $P(Y_t|X_t, A_{t−1})$, and the reward function $E(R_t|X_t, A_{t−1})$. The goal of the agent is to learn a policy $\pi$ which maps the observation history (trajectory) into an action $A_t$ to maximize $\pi$’s quality or value.

Parameters

No parameters found.

Table of Algorithms

Name Year Time Space Approximation Factor Model Reference
Hauskrecht; ; 2000 Deterministic
Pineau; Gordon; & Thrun; ; 2003 Deterministic
Braziunas & Boutilier; ; 2004 Deterministic
Poupart; ; 2005 Deterministic
Smith & Simmons; ; 2005 Deterministic
Spaan & Vlassis; 2005 Deterministic
Satia & Lave; ; 1973 Deterministic
Washington; ; 1997 Deterministic
Barto;Bradtke; & Singhe; ; 1995 Deterministic
Paquet; Tobin; & Chaib-draa; ; 2005 Deterministic
McAllester & Singh; ; 1999 Deterministic
Bertsekas & Castanon; ; 1999 Deterministic
Shani; Brafman; & Shimony; 2005 Deterministic