136 R. Schöbi and E. Chatzi 13.2 The POMDP Framework The POMDP is essentially a methodology describing the interaction between an agent and a system. The agent is initialized at a state s. It then executes an action a, shifting the system state froms to s0, which results in an instantaneous reward r and an associated observation o. Based on the received reward and the observation, the agent chooses the next action. This cycle of taking actions and receiving rewards/observations (sequential decision making) will last till either the agent reaches a goal state or until the lifespan of the agent is exhausted. In a partially observable framework, the agent doesn’t have perfect knowledge about his state or about the system behavior. Uncertainty is involved in both the system observations and in the agent’s actions. In the mathematical framework, the elements of the system are summarized in the tuple fS,A,T, ,O,Rg. fSg is the set of system states. fAgis the set of discrete actions available to the agent. T: S A!…(S) is the transition model which describes the evolution of the system to a future states0, depending on the current states and the agent’s actiona. Bayes’ rule is applied for updating the system state using the probability p(s0js, a). f˝g is a set of discrete observations. O: S A!…( ) is the observation model defining the probability of obtaining an observation o in case the system is in state s; this probability is described by p(ojs). Ris the reward function that outputs a real cost/reward value as ra(s) 2R. The reward can be modeled as dependent upon the current state, the agent’s action or a combination of both. The agent’s knowledge about its state is represented by the belief state b(s), which is a probability density distribution over the state space. Once the system evolves due to the agent’s action, the belief state is updated using the previous belief state, the executed action and the received observation message. This is performed by employing Bayes’ rule. In the discrete state space this is: ba;o s0 D p oˇ ˇ ˇ s0 p oˇ ˇ ˇ b;a X s2S p s0ˇ ˇ ˇ s;a b.s/ (13.1) Respectively in the continuous state space: ba;o s0 D p oˇ ˇ ˇ s0 p oˇ ˇ ˇ a;b Z s p s0ˇ ˇ ˇ s;a b.s/ (13.2) This paper will focus on the continuous state POMDP which relies on the work of Porta et al. [13]. For the derivation of the corresponding discrete-state POMDP, the interested reader is referred to the literature [5, 13–16]. In the continuous state problem, the received reward at any time step is then given as: dQDZ s r .s; a/ b.s/ (13.3) The total reward over the entire lifetime of the agent is then the sum of the rewards of each individual time step, and possibly a terminal reward. Qtot DQtermC T X iD1 dQi (13.4) In the context of planning, many solving algorithms are available for solving the discrete-state POMDP [14, 16]. They are based on recursive formulas, determining the optimal actions for each belief state backwards in time. The termhorizon (denoted byn) is defined as the number of time steps (decision periods) left till the end of the agent’s life, resp. the end of the lifetime of an engineering structure. The computation thus starts at horizon 1 and goes up to the number of intended decision periods. Vn DHVn 1 (13.5) The idea of the Markov process is that each calculation in a chain is only dependent on the previous chain element rather than all chain elements. Thus, a recursive algorithm to calculate the optimal strategies in each decision horizon is possible. The optimal strategy in the planning is based on the maximum future expected reward which is defined as:
RkJQdWJsaXNoZXIy MTMzNzEzMQ==