Control Engineering Practice ∎ (∎∎∎∎) ∎∎∎–∎∎∎ Contents lists available at ScienceDirect Control Engineering Practice journal homepage: www.elsevier.com/locate/conengprac Optimal learning control of oxygen saturation using a policy iteration algorithm and a proof-of-concept in an interconnecting three-tank system Anake Pomprapa n, Steffen Leonhardt, Berno J.E. Misgeld Philips Chair for Medical Information Technology, RWTH Aachen University, Aachen, Germany art ic l e i nf o a b s t r a c t Article history: Received 30 November 2015 Received in revised form 18 June 2016 Accepted 25 July 2016 In this work, “policy iteration algorithm” (PIA) is applied for controlling arterial oxygen saturation that does not require mathematical models of the plant. This technique is based on nonlinear optimal control to solve the Hamilton–Jacobi–Bellman equation. The controller is synthesized using a state feedback configuration based on an unidentified model of complex pathophysiology of pulmonary system in order to control gas exchange in ventilated patients, as under some circumstances (like emergency situations), there may not be a proper and individualized model for designing and tuning controllers available in time. The simulation results demonstrate the optimal control of oxygenation based on the proposed PIA by iteratively evaluating the Hamiltonian cost functions and synthesizing the control actions until achieving the converged optimal criteria. Furthermore, as a practical example, we examined the performance of this control strategy using an interconnecting three-tank system as a real nonlinear system. & 2016 Elsevier Ltd. All rights reserved. Keywords: Policy iteration algorithm Optimal control Reinforcement learning Control of oxygen saturation Biomedical control system Closed-loop ventilation 1. Introduction Hypoxia, or oxygen deficiency, is a common consequence of respiratory insufficiency. If a patient with hypoxia is not properly treated on time, the prolonged state of impaired oxygenation can lead to potentially lethal conditions, such as cerebral hypoxia, cardiac malfunction, or multiple organ failure. The most effective therapy is to supply the patient with an increased oxygen fraction in the ventilated air, which is referred to as oxygen therapy (Tarpy & Celli, 1995). A proper control output of this therapy is the resetting oxygen content in the body, including peripheral oxygen saturation (SpO2), arterial oxygen saturation (SaO2) or partial pressure arterial oxygen (PaO2), can be used in a negative feedback configuration. A single-input single-output (SISO) system can be formulated and a closed-loop control of oxygen therapy can be realized in clinical scenarios, which are unique, complex, and lifethreatening. To our knowledge, closed-loop ventilation for oxygen therapy developed gradually (Brunner, 2002) and was dependent on progress in control engineering. Early publications date back to 1975: this paper describes the control of SpO2 (positioned at the ear) with a ‘bang–bang’ control of a controlling input fraction of inspired oxygen (FiO2) and the results of limit cycles achieved in n Corresponding author. E-mail address: [email protected] (A. Pomprapa). anesthetized dogs (Mitamura, Mikami, & Yamamoto, 1975). In 1985, a linear quadratic regulator (LQR) was shown for the dual control of oxygen and carbon dioxide (Giard, Bertrand, Robert, & Pernier, 1985). In 1987, a multiple-model adaptive control (MMAC) was applied for the control of SpO2 in mongrel dogs and the results were compared with a proportional integral (PI) controller (Yu et al., 1987). In 1991, the multivariable inputs of FiO2 and positive end-expiratory pressure (PEEP) were applied to control PaO2 using a proportional–integral–derivative (PID) controller in four mongrel dogs (East, Tolle, MCJames, Farrell, & Brunner, 1991). In 2004, a non-linear adaptive neuro-fuzzy inference system (ANFIS) model was used to estimate shunt in combination with dynamic changes of blood gases for controlling PaO2 (Kwok, Linkens, Mahfouf, & Mills, 2004). Further simulation of septic patients was carried out based on a hybrid knowledge/model-based advisory control. Recently, feedback-oriented oxygen therapy was presented in preterm infants, where the controlled variable was SpO2 (Claure & Bancalari, 2009). In addition, based on the works in our research group, a proportional–integral (PI) controller with gain scheduling (Walter et al., 2009), a Smith predictor with an internal PI controller (Lüpschen, Zhu, & Leonhardt, 2009), a knowledgebased controller (Pomprapa, Misgeld, Lachmann, & Leonhardt, 2013), and, potentially, a self-tuning adaptive controller (Pomprapa, Pikkemaat, Lüpschen, Lachmann, & Leonhardt, 2010) and a funnel controller (Pomprapa, Alfocea, Göbel, Misgeld, & Leonhardt, 2014) can be used for this particular control problem. Moreover, in 2014, SpO2 between 92% and 94% was targeted based on the http://dx.doi.org/10.1016/j.conengprac.2016.07.014 0967-0661/& 2016 Elsevier Ltd. All rights reserved. Please cite this article as: Pomprapa, A., et al. Optimal learning control of oxygen saturation using a policy iteration algorithm and a proof-of-concept in an interconnecting three-tank system. Control Engineering Practice (2016), http://dx.doi.org/10.1016/j. conengprac.2016.07.014i 2 A. Pomprapa et al. / Control Engineering Practice ∎ (∎∎∎∎) ∎∎∎–∎∎∎ automatic adjustment of FiO2 using a step change approach (Saihi et al., 2014). Based on a literature search, three different categories were found for the control system design and their advantages and disadvantages are given below. A model-based approach is a traditional procedural design in the control community. Generally, it requires a mathematical description of the system behavior. The difficulty of this approach lies in the selection of an appropriate model structure for this particular problem and has to deal with model uncertainties based on varying conditions of the system. Therefore, self-tuning adaptive control was introduced for this control problem (Pomprapa et al., 2010). With this model-based approach, system performance and further corrective actions can be preanalyzed. A model-free approach requires neither prior knowledge of the system nor system identification, which is a major advantage. Therefore, the time-consuming process for determining an individual model can be significantly reduced, which is certainly beneficial for a patient with severe hypoxia. In fact, the patient requires an immediate therapeutic action. Hence, this approach is beneficial in the real clinical scenarios. Some of the proposed control techniques were, for example, a knowledge-based controller (Pomprapa, Misgeld, Lachmann et al., 2013) or a funnel controller (Pomprapa, Alfocea, et al., 2014). Nevertheless, the drawback is the requirement for experience to reach a good performance. An intelligent hybrid approach of combined model-based and model-free methods (Kwok et al., 2004) requires a knowledge and model-based advisory system for intensive care ventilation. The strength and the weakness have been mentioned for both modelbased approach and model-free approach. The design of this approach was based on neuro-fuzzy system (ANFIS), which should closely resemble to mental operation. However, neither an animal experiment nor a clinical result has yet been reported. For best patient benefits, we employ a model-free approach in this work. Therefore, we examine the application of a generally known control strategy to the control of oxygenation based on SaO2 measurement. For the first time (first results on this new approach have been prepublished in Pomprapa, Mir Wais, Walter, Misgeld, & Leonhardt, 2015), we propose to use a reinforcement learning optimal control method, called the policy iteration algorithm (PIA) (Bhasin et al., 2013; Vamvoudakis & Lewis, 2010; Vrabie, Pastravanu, Abu-Khalaf, & Lewis, 2009), for this particular application. This control strategy should provide an optimal solution for not only treating hypoxia, but also avoiding hyperoxia (excess oxygen in the lungs) (Clark, 1974; Kallet & Matthay, 2013; Mach, Thimmesch, & Pierce, 2011). The PIA is classified as reinforcement learning with the actor–critic architecture framework, which conventionally learns behavior through interactions with a dynamic environment and a survey for the development of reinforcement learning can be found in Kaelbling, Littman, and Moore (1996). Based on this particular approach, it improves the effectiveness of learning power using a policy gradient for stochastic search for an optimal result. The technique is based on a recursive two-step iteration, namely policy improvement and policy evaluation. These steps are repeated iteratively until achieving the stopping criteria for the converged optimal solution (Vamvoudakis & Lewis, 2010; Vrabie et al., 2009). Practical implementations of PIA control in other areas include optimal-loadfrequency control in a power plant (Wang, Zhou, & Wen, 1993), a DC–DC converter (Wernrud, 2007), adaptive steering control of a tanker ship (Xu, Hu, & Lu, 2007), and a hybrid system of a jumping robot (Suda & Yamakita, 2013). Therefore, it is motivated to apply this modern control strategy to biomedical applications and, specifically, to closed-loop ventilation therapy. Computer simulations are used to investigate the feasibility of this controller in oxygen therapy. Subsequently, a proof-of-concept is implemented in a real application for a nonlinear interconnecting three-tank system in order to control the water level in the last tank. In fact, the dynamics of this three-tank system resembles that of a patient with respiratory deficiency, i.e. nonlinear and with time-delay behavior. In this article, Sections 2.1 and 2.2 provide the statement of the medical perspective and the formulation of optimal control problem. Section 3 presents a control system design using the PIA for nonlinear optimal control based on the Hamilton–Jacobi–Bellman (HJB) equation. The system identification of the cardiopulmonary system to identify a mathematical model for further design of the PIA controllers, and a simulation of this control strategy is proposed in Sections 4 and 5, respectively. A practical example of this controller is presented in Section 6 for controlling the water level in a nonlinear interconnecting three-tank system. Section 7 presents a discussion, and our conclusions are presented in Section 8. 2. Problem formulation 2.1. Statement of the medical perspective The overall transfer function from the settings of inspired oxygen fractions FiO2 to the oxygen saturation SaO2 measured in arterial blood depends on many factors, including lung function and blood transport (which then depends on cardiac output, state of circulation, etc.). Let G(s) ¼SaO2(s)/FiO2(s) be the transfer function describing the system under investigation. G(s) certainly is nonlinear (e.g. due to the nonlinear saturation function, see West, 2011) and depends on many unknown and time-variant conditions (like e.g. fluid status, heart function, or many diseases). Often, for the individual patient requiring artificial ventilation, this model is not available. Furthermore, as there usually is no time for individual model selection and parameter identification, this has been the motivation to investigate the performance of control algorithms which do not require explicit models. As an illustrative example and in order to roughly understand the dynamics, we provide the transfer function G(s) of a female domestic pig weighing 34 kg, which we obtained during a laboratory experiment (Lüpschen et al., 2009). In this special case, respiratory distress was induced by repeated lung lavage (ARDSmodel, acute respiratory distress syndrome; Ashbaugh, Bigelow, Petty, & Levine, 1967; Lachmann, Robertson, & Vogel, 1980). 2.2. Optimal control problem We consider a specific class of nonlinear systems, namely affine nonlinear systems of the following form: ẋ ( t ) = f ( x) + g ( x)·u( t ), (1) n where x(t ) ∈ χ ⊆ denotes the states of the system in a vector form of n dimension, u(t ) ∈ υ ⊆ represents the control input or FiO2, and - : χ × υ → n is Lipschitz continuous on χ × υ, such that the state vector x(t ) is unique for a given initial condition x 0 . Initially, it is assumed that the system is stabilizable. This type of model has been used to describe the dynamics of various plants, for example a robot manipulator (Sun, Sun, Li, & Zhang, 1999), a continuous stirred-tank reactor (CSTR) (Kamalabady & Salahshoor, 2009) and a non-interconnecting three-tank system (Orani, Pisano, Franceschelli, Giua, & Usai, 2011). Let us consider a cost function V (x) given by Eq. (2), which is to be minimized. V (x) = ∫0 T∞ r (x(τ ), u(τ ))dτ , (2) let r(x, u) be determined by Q (x) + uT Ru. If Q (x) ∈ is positive- Please cite this article as: Pomprapa, A., et al. Optimal learning control of oxygen saturation using a policy iteration algorithm and a proof-of-concept in an interconnecting three-tank system. Control Engineering Practice (2016), http://dx.doi.org/10.1016/j. conengprac.2016.07.014i A. Pomprapa et al. / Control Engineering Practice ∎ (∎∎∎∎) ∎∎∎–∎∎∎ definite and continuously differentiable (for example, if x = 0 , then Q (x) = 0 and Q (x) > 0 for all x ), then R ∈ represents a positive-definite matrix for a penalty or weighting of the control input, and T∞ denotes a time when the states are sufficiently close to zero. The Hamiltonian of this control problem can be defined by H (x, u, V x*) = r (x(t), u(t )) + ∇x V *(f (x(t)) + g (x(t))u(t ), (3) where the gradient ∇x V * satisfies the following HJB equation and ∇x symbolizes a partial differential in x . We note that minuH (x, u, ∇x V *) = 0 (4) The optimal solution is provided by ⎛ ⎞ ⎛ ⎞ 1 u* = − R−1g T ⎜ x⎟∇x V *⎜ x⎟, ⎝ ⎠ ⎝ ⎠ 2 (5) which is an algorithm for policy improvement (Ohtake & Yamakita, 2010). For a linear continuous-time system, the HJB equation transforms to the well-known Riccati equation (Vamvoudakis, Vrabie, & Lewis, 2009). The converged solution is equivalent to a result from a linear–quadratic-regulator (LQR) controller. If the optimal control policy (Eq. (5)) is substituted in the nonlinear Lyapunov equation, the HJB equation can be derived in terms of the Hamiltonian cost function (Vamvoudakis & Lewis, 2010). For this particular control problem, we insert the optimal solution of Eq. (5) into Eq. (3) with the condition provided in Eq. (4). It yields 0 = Q (x) + u*T Ru* + ∇x V *(x)(f (x(t)) + g (x(t ))u*. 3 system is stabilizable for a linear case and the solution of nonlinear Lyapunov equation is smooth and positive definite for the nonlinear case. The block diagram of the PIA is shown in Fig. 1, with an internal actor and critic subsystem of state feedback architecture. Based on Fig. 2, the procedure for PIA implementation is provided as follows: 1. Assign an initial controller that can stabilize the system. 2. Compute the evaluation function V˜ (x(t )) numerically in time domain, with V˜ (0) = 0. 3. Based on the response of V˜ (x(t )) in the previous step, estimate all unknown parameters in the evaluation function using the least squares algorithm (Vrabie et al., 2009) or other parameter estimation techniques. 4. Evaluate a norm between the updated estimated unknown parameters and the previous estimated updated parameters compared to a predefined value (ϵ) (a convergence test). A PIA must be terminated if the norm value is satisfied or the estimated parameters are converged. Otherwise, an update of the controller is required based on Eq. (5) for policy improvement and a return to Step 2 to compute the evaluation function V˜ (x(t )) for further policy evaluation. (6) This equation can be rewritten as 0 = Q (x) + ∇x V *T (x)f (x) − 1 ∇x V *T (x)g (x)R−1g T (x)∇x V *(x), 4 (7) where V *(0) = 0 and this is a formulation of the HJB equation. The aim of implementation of optimal control is to solve the HJB equation (Eq. (7)) in order to find the Hamiltonian cost function and substitute it in Eq. (5) for the control policy. However, with a nonlinear nature, it is generally difficult to find the solution to the HJB equation. In this article, the policy iteration algorithm is introduced for the optimal control of oxygen saturation based on the animal model of induced acute respiratory distress syndrome (ARDS), which is given in the subsequent sections. The iterative process is therefore introduced for solving this optimal control problem (Vamvoudakis & Lewis, 2010; Vrabie et al., 2009). Fig. 1. Block diagram of the policy iteration algorithm of full state feedback structure. 3. Policy iteration algorithm The evaluation function equation (2) can be modified in the following form: V˜ (x(t)) = ∫t t+T r (x(τ ), u(τ ))dτ + V˜ (x( t+ T )), (8) ~ where V (0) = 0 and T represents a sampling time used in a process of evaluation. Hence, V˜ (x(t )) can be solved in time domain using Eq. (8) and the control signal u* can be updated based on Eq. (5). Subsequently, the function of V˜ (x(t )) can be estimated by either a linear function or a nonlinear function of x(t ), so that all unknown parameters can be computed by using a simple least-squares algorithm, a gradient descent algorithm, a recursive least squares algorithm (Vrabie et al., 2009) or a neural network (Bhasin et al., 2013) for parameter estimation. The proof of stability and guaranteed convergence for the control policy has been provided by Vrabie et al. (2009) for a linear system and by Vamvoudakis and Lewis (2010) for a nonlinear system under the assumption that the Fig. 2. Flowchart for the implementation of the policy iteration algorithm. Please cite this article as: Pomprapa, A., et al. Optimal learning control of oxygen saturation using a policy iteration algorithm and a proof-of-concept in an interconnecting three-tank system. Control Engineering Practice (2016), http://dx.doi.org/10.1016/j. conengprac.2016.07.014i 4 A. Pomprapa et al. / Control Engineering Practice ∎ (∎∎∎∎) ∎∎∎–∎∎∎ A flowchart of this algorithm is presented in Fig. 2. A prerequisite for this algorithm is that an initial controller that can stabilize the system does exist. In practice, a simple proportional– integral (PI) controller can be used as the controller for the first evaluation of the system performance. Thus, PIA will optimize the Hamiltonian cost function at each iteration in order to achieve an extremal cost for the underlying nonlinear system. Using these steps, online implementation can be realized, leading to an optimal controller for the nonlinear system. It is worth noting that an iteration of policy evaluation and policy improvement is required until achieving the converged optimal result of the estimated unknown parameters (ω). Therefore, a repeated control implementation must be active in order to update the new control policy and compute further values of the cost function, which is a basic characteristic of this control scheme. The details of the complete algorithm can be found in Vrabie et al. (2009). 4. System identification of the oxygenation model A model of the cardiopulmonary system is identified using a large animal model based on induced acute lung injury. The mathematical model was identified based on measurements from a female domestic pig weighing 34 kg. The animal received premedication, and was then anesthetized and tracheotomized in supine position. A catheter for measurement of SaO2 (CeVOX, Pulsion Medical System SE, Feldkirchen, Germany) was inserted into the carotid artery. Under the condition of stable ventilatory response, the animal underwent mechanical ventilation (SERVO 300, Marquet Critical Care AB, Solna, Sweden). The system configuration is shown in Fig. 3; a medical panel PC (POC-153, Advantech Co., Ltd, Taipei, Taiwan) was used for data acquisition. During baseline ventilation of the animal, acute respiratory distress syndrome (ARDS) was induced using repetitive lung lavage for washout of surfactant (Lachmann et al., 1980). Warm isotonic saline solution of 9% was used during the entire process of ARDS induction. All animal procedures were approved by the local ethical authorities. Surfactant depletion results in increased alveolar surface tension, which facilitates alveolar collapse or partial lung collapse (Ballard-Croft, Wang, Sumpter, & Zhou, 2012). To implement system identification, multi-step FiO2 settings were given to extract the underlying porcine dynamics during induced ARDS. Measurement of the input–output relationship is presented in Fig. 4. We separated the data into two data sets. The first part, containing 75% of all measured data, was used for system identification, while the second data set was used for validation. System identification was initially carried out with the assumption of a first-order model with time delay (Lüpschen et al., 2009). After estimating the unknown parameters with a validation data set, the following plant transfer function was derived at a certain operating point (FiO2 ¼0.59 and SaO2 ¼93.3%). This transfer function, given in Eq. (9), incorporates the dynamics of the mechanical ventilator, the porcine dynamics, and the dynamics of the SaO2 sensor: Fig. 4. Input–output response for system identification based on a porcine model of induced ARDS (Lüpschen et al., 2009). G(s ) = SaO2 (s ) 0.153 = ·e−11s FiO2 (s ) 17.29s + 1 (9) It should be noted that a healthy subject normally has SaO2> 97% in ambient air at standard altitude air pressure. A state of hypoxia can be observed in Fig. 4. The FiO2 value was given more than that of the ambient air (FiO2 ¼0.21) in order to maintain output SaO2 in the range of 85–100%. For this particular system, a time-delay of 11 s is identified. For further implementation for a control system design, the Padé-approximation technique was applied to approximate the delay. This is in fact a nonlinear system, but we use only a small region and thus can roughly estimate the system by a linear model. The plant transfer function is simplified as follows: G(s ) = 0.153 −s 3 + 1.091s 2 − 0.4959s + 0.09016 · 3 17.29s + 1 s + 1.091s 2 + 0.4959s + 0.09016 (10) This mathematical model is used for a design of the PIA controller for controlling SaO2. Closed-loop ventilation is formed based on oxygen therapy in order to optimally adjust the controlling input FiO2 for this hypoxic cardiopulmonary system. The frequency responses are also provided in Fig. 5 for both system identification data set and the validation data set. In the system identification data set, the model represents the measured data well in the sense of least mean squares algorithm for parameter estimation. Different model alternatives have been studied to a higher-order model and the model evaluation are based on the mean squares error (MSE). The summaries of model evaluation are given in Table B1. According to the results given in Table B1 in the appendix, the first-order model with time-delay based on the Padéapproximation (Eq. (10)) gives the best prediction in terms of MSE for the data set of model validation. Therefore, this model will be used for further simulation of closed-loop oxygen therapy. 5. Simulation of the closed-loop oxygen therapy Fig. 3. Block diagram of an oxygen therapy system based on a porcine model of induced ARDS. Based on the identified model of Eq. (10), the derived transfer function is served as an unknown model of pathophysiology of the pulmonary system for this simulation of the closed-loop oxygen therapy. It is therefore replaced the ‘System’ block diagram as previously given in Fig. 4. The full procedure for implementing PIA as given in the flowchart is then carried out with an initial Please cite this article as: Pomprapa, A., et al. Optimal learning control of oxygen saturation using a policy iteration algorithm and a proof-of-concept in an interconnecting three-tank system. Control Engineering Practice (2016), http://dx.doi.org/10.1016/j. conengprac.2016.07.014i A. Pomprapa et al. / Control Engineering Practice ∎ (∎∎∎∎) ∎∎∎–∎∎∎ 5 Fig. 5. Comparison of Bode diagram for the identified system model and the measured data set. proportional-integral (PI) controller with parameters Kp ¼ 1 and Ki ¼0.5 in order to stabilize the system. Based on this predetermined control structure, the simulation can be set up and all system states and the control input u(t) are recorded with a fixed sampling time of 100 ms. A cost function V(x) based on Eq. (2) can be computed by setting Q(x) ¼I and R ¼ 1 regardless of the system model. Further evaluation function V˜ (x) can be thereafter calculated based on Eq. (8). Subsequently, ∇x V * can be computed for a control policy, as given in Eq. (5). The synthesized control action is iteratively implemented until achieving the converged optimal solution. For a control system design of PIA, the evaluation function V˜ (x) is set up for the approximation of V˜ (x(t )) based on four different states, namely x1, x2, x3 and x4 , where the time delay was calculated by a third-order Padé approximation, defined as follows: ( V˜ x1, x2 , x3, x4 ) = w1x12 + w2x1x2 + w3x1x3 + w4x1x4 + w5x22 + w6x2x3 + w7x2x4 + w8x32 + w9x3x4 + w10x42 , (11) where { wi: i ∈ I and i ≤ 10} is the set of unknown parameters, which are linearly independent based on different basis functions. This equation is the simplest parametric form of second-order surface squared by four states for fitting the evaluation function, which requires further mathematical manipulation for a policy update. A higher-order surface would be alternatively applied for a better fitting, but has not pursued due to the potential danger of overfitting. Based on the animal model of induced acute lung injury provided in Eq. (9) (Lüpschen et al., 2009), we assumed that all internal states are accessible for state feedback configuration of the PIA controller. This is a drawback. If the system states are not physically accessible and further design of state observer is definitely required (Lu & Ho, 2014). With this strong assumption for simulation, the stable converged solution of wi at the third iteration is given that w1 = 6.82, w2 = 8.45, w3 = 14.47, w4 = 8.35, w5 = 10.52, w6 = 18.30, w7 = 9.63, w8 = 14.18, w9 = 15.81, and w10 = 14.32, using the initial PI controller. The evolution of wi is given in the Appendix (Fig. A1). Fig. 6 shows the control performance of a step response ( yref (t ) = 1) at an initial iteration with a stable result for the control of oxygenation. The control performance at the third iteration is also shown with the optimal result for the control of oxygenation in Fig. 7 that all states are driven to become zero with the settling time about 60 s compared to that of 600 s by the initial PI controller, as shown in Fig. 6. Fig. 6. Control performance at an initial iteration for the control of oxygenation. + Fig. 7. Comparative control performance of policy iteration algorithm for the control of oxygenation. This simulation shows the non-minimum phase at the onset of control, which the output y(t ) or SaO2 responded to in the reverse direction. This was due to the delay effect because of a positive zero position after implementation of the Padé approximation. Furthermore, the optimal response can be achieved after application of the PIA at the third iteration in terms of cost function and Please cite this article as: Pomprapa, A., et al. Optimal learning control of oxygen saturation using a policy iteration algorithm and a proof-of-concept in an interconnecting three-tank system. Control Engineering Practice (2016), http://dx.doi.org/10.1016/j. conengprac.2016.07.014i A. Pomprapa et al. / Control Engineering Practice ∎ (∎∎∎∎) ∎∎∎–∎∎∎ 6 an improved steady-state time constant. For a practical aspect in a real patient, a step reference is iteratively required for multiple times for policy improvement and policy evaluation until meeting stopping criteria for an optimal control. In general, a patient with acute lung injury has a poor oxygenation ( SaO2< 88%) and the oxygenation goal of standard ARDSNet ventilation therapy (Pomprapa, Schwaiberger et al., 2014) is to maintain SaO2 ¼ 88–95%. Therefore, we have plenty of SaO2 targets for a step reference during a treatment that is compliant to this standardized ventilation protocol. 6. Practical application To implement the PIA controller in a real application, an interconnecting three-tank system was set up with the configuration shown in Fig. 8. Pressure sensors (RPUP001G2A, Sensortechnics GmbH, Germany) were installed at the bottom of the tanks to estimate the height of the fluid. A submersible pump (Barwig 0111, Barwig Wasserversorgung, Germany) was used as an actuator to draw the fluid from a reservoir into Tank 1 with a maximum flow of 18 l/min. Due to the interconnection structure, fluid flows from Tank 1 to Tank 2 and Tank 3, successively. The fluid in Tank 3 is then drained back to the reservoir. The overall system is operated under a dSPACE platform (MicroAutoBox II, dSPACE GmbH, Germany) with three sensor inputs corresponding to the height of the fluid in each tanks ( x1, x2, x3) and one controllable current output to the pump associated with adjustable flow into Tank 1 ( Q1). A closed-loop system was formed with all accessible states, namely the height of the fluid in the tanks ( x1, x2, x3). The aim was to control the height x3 of Tank 3 with the adjustable flow Q1 as a controlling input to the system, where A 0 represents the crosssectional area of the tank. All states are used for synthesis of the controller using the PIA for this nonlinear system of an interconnecting three-tank system. This three-tank system is proposed here as a practical example due to its equivalence to the blood circulation system in a human body. The submersible pump is used to represent a heart as an energy source to supply fluid or blood to the whole system. The submersible pump is then regulated by dSPACE control unit behaving the function of the brain. Tank 1, Tank 2 and Tank 3 symbolize blood store in tissue, venous blood store and arterial blood store, respectively (Kron, Böhm, & Leonhardt, 1999). This particular system has a special characteristic that all internal states in terms of the fluid level can be fully accessed based on a low-cost pressure sensor. Therefore, PIA is applicable for this particular example. Nevertheless, the height control in a three-tank model is much simpler than control of gas concentrations and this particular model cannot fully represent or simulate the cardiopulmonary system under acute lung injury is unique. A real patient-in-the-loop system should be set up to test the performance of this controller. Due to a long process for new ethical approval and a limited budget, we cannot at the moment carry out this control algorithm in an animal trial. Instead this interconnecting three-tank system as a nonlinear system is motivated for the implementation in order to demonstrate the tracking performance and optimal solution of this control algorithm for a real practical setup. This nonlinear system can be modeled using mass balance equations for each tank, which yields the following relationship dx1 = − Q12 + Q 1 dt dx A 0· 2 = − Q 23 + Q12 dt dx3 = − Q 30 + Q 23, A 0· dt A 0· (12) where Q12, Q23, and Q30 represent the fluid flow between Tank 1 and Tank 2, fluid flow between Tank 2 and Tank 3, and fluid flow Fig. 8. System configuration and system realization for computerized control of the water level in the interconnecting three-tank system as a simulation of blood circulation system. Please cite this article as: Pomprapa, A., et al. Optimal learning control of oxygen saturation using a policy iteration algorithm and a proof-of-concept in an interconnecting three-tank system. Control Engineering Practice (2016), http://dx.doi.org/10.1016/j. conengprac.2016.07.014i A. Pomprapa et al. / Control Engineering Practice ∎ (∎∎∎∎) ∎∎∎–∎∎∎ from Tank 3 back to the reservoir, respectively. With the assumption that x1 ≥ x2 ≥ x3, these can be rewritten as dx1 = − A12· 2·g0 ·|x1 − x2| + Q 1 dt dx A 0· 2 = A12· 2·g0 ·|x1 − x2| − A23 · 2·g0 ·|x2 − x3| dt dx3 A 0· = A23· 2·g0 ·|x2 − x3| − A30· 2·g0 ·x3 , dt A 0· (13) 7 to their final states quicker and the control target can be reached faster in a series of iteration episode. In this real application, u0 was introduced to reach the target reference point of x3 = 10 cm ; the settling time was remarkably improved for this particular interconnecting three-tank system. Therefore, real implementation for oxygenation control should be applicable with this quasi non-identifier approach, which can provide a practical solution for treating patients with respiratory failure by means of closed-loop oxygen therapy. 2 where g0 denotes the gravity (¼981 cm/s ). They can be simplified in a state model of 7. Discussion x1̇ = f1(x1, x2) + Q 1 x2̇ = f2 (x1, x2 , x3) x3̇ = f3 (x2 , x3). (14) Based on this mathematical model, the system has a nonlinear coupling behavior. For the actual implementation, all initial states are zero (empty tank) and neither leakage nor additional fluid input to any tanks is allowed. For this application of PIA controller, it is worth noting that we do not need a mathematical model of the nonlinear system to control height x3 of Tank 3. We require only the state information and the computed evaluation function for every iteration process under the assumption that the system must be stabilizable. To control x3 = 10 cm using the PIA, the following evaluation function was used with the three states corresponding to the height of Tanks 1, 2 and 3, respectively: ( ) V x1, x2 , x3 = w1x12 + w2x1x2 + w3x1x3 + w4x22 + w5x2x3 + w6x32 (15) The update control policy was in the following form: 1 ∂V (x) + u0 , u* = − R−1g T (x) 2 ∂x (16) where u0 represents a controlling input at the operating point. The control performance with an optimal result is presented in Fig. 9 with a sampling time of 100 ms, showing that the control objective of x3 = 10 cm was fulfilled with minimal knowledge of the system. A PID controller was given for the initial implementation in order to collect all information from the system for policy evaluation. Thereafter, a new controller can be designed based on Eq. (16) and updated into the system as policy iteration. In the first iteration, the system reached steady state at t¼130 s, while the performance of steady-state time improved significantly at the second iteration to 100 s. This shows that a PIA controller is realizable in actual practice and that results can be achieved, which minimize the evaluation function V(x) for each iteration using the computed policy update. Consequently, all states return A number of potential controllers can be used for the control of oxygenation. In general, the settling time was approximately 300 s using a PI controller and 250 s using a Smith predictor for the control of oxygenation in an extracorporeal membrane oxygenator (ECMO) (Walter et al., 2009). Therefore, one of these controllers can be applied for initial implementation in a PIA. These control strategies ensure that the assumed need for an initial controller that can stabilize the system can be fulfilled. A PIA controller should then be realizable in actual clinical practice for optimal control of oxygenation with a settling time less than 250 s; this should be of considerable benefit to patients suffering from respiratory insufficiency. In patients with ARDS, an acceptable range for SaO2 is 88–95% based on treatment using the ARDSNet protocol (ARDSNet, 2000; Pomprapa, Schwaiberger et al., 2014). In general, when looking at input-output dynamics and considering identifiability, a good choice for the sampling time is to set T equal to 1/10 or even 1/20 of the fastest time constants that need to be identified (Isermann & Münchhof, 2010). When considering the process dynamics provided in Eq. (9), this would lead to the sampling time T<1.7 s. Considering the repetitive nature of both natural and artificial respiration, another influencing factor is the rate of respiration (RR). In fact, the alveolar gas transport is modulated by the inhalation and exhalation process. Hence, higher frequency components of respiration rate will, at least theoretically, also be visible in the actual gas transport rates across the alveolar membrane. As for ARDS patients breathing rates are limited by the therapeutic protocols in addition, it is wise to consider these practical boundary conditions as well. Specifically, the maximum respiratory rate for the well-known ARDSNet protocol (ARDSNet, 2000) is 35 bpm (¼0.58 Hz) and it is 60 bpm ( ¼1 Hz) and for the open lung management (Pomprapa et al., 2015). Finally, SaO2 is a signal that is measured based on blood pulsations in the skin (Wukitsch, Petterson, & Pologe, 1988). Hence, theoretically we also have to consider heart rate (HR) and the theorem of Shannon Fig. 9. Control performance of a PIA controller for an interconnecting three-tank system with a sampling time of 100 ms. Please cite this article as: Pomprapa, A., et al. Optimal learning control of oxygen saturation using a policy iteration algorithm and a proof-of-concept in an interconnecting three-tank system. Control Engineering Practice (2016), http://dx.doi.org/10.1016/j. conengprac.2016.07.014i 8 A. Pomprapa et al. / Control Engineering Practice ∎ (∎∎∎∎) ∎∎∎–∎∎∎ stating that if a signal of continuous time has no frequency information higher than f Hz, a sampling time T < 1/(2f) s will guarantee a perfect reconstruction of the original frequency content. However, in our application, we used a commercial device (the CeVOX) for saturation measurements that handles the pulsatile nature of the saturation signal internally by using appropriate sampling times. Hence, as we only look at SaO2 as a quasicontinuous function of FiO2, it seems sufficient to select the sampling time T¼ 0.1 s A 34-kg pig model of induced acute lung injury was repeatedly lavaged in order to remove pulmonary surfactant (anti-inflammatory molecules) using saline solution for 3–6 l, leading to partial lung collapse and poor oxygenation. The pig was selected because of its structural resemblance of pulmonary system to that of a human in terms of size, structure and a mechanism of gas exchange. Other animal alternatives such as sheep, or monkey can be considered for the experiment of this particular lung disease. The pig of this weight can therefore be used for a prediction for the dynamics of acute lung injury. For our specific application (i.e. oxygen therapy), a pig of that body weight is an established animal model which can be used as a patient model for a human weighing two or three times. Note that many physiological parameters are pretty similar in men and pigs, such as blood pressure, resting heart rate or SaO2. There are, however, other physiological variables that scale with weight, such as tidal volume which is usually given as ml/kg. However, an individual patient model is unique and we require system identification in order to understand the dynamical system. Hence, it is of great advantage that a PIA controller can give an optimal control performance with minimal knowledge of the system model. In the present study, the first-order model with time delay was used for simulation and evaluation of the PIA. The system had a time delay of 11 s, reflecting the translational effects in a change of SaO2 according to a change in FiO2, which also depends on the dynamics of the mechanical ventilator and of the subject, as well as the positioning of the SaO2 catheter. In this case, the sensor was placed in the carotid artery (Lüpschen et al., 2009), where the artery supplies oxygenated blood to the brain. If the sensor is positioned elsewhere (e.g. finger, ear or tongue), this leads to a different time delay, which complicates the design of control system. Therefore, with the PIA controller for the control of oxygenation, all states of the system are assumed be accessible but, in practice, a state observer for a time-delay system may need to be designed (Lu & Ho, 2014). In the present study, a simple least-squares algorithm was used as the parameter estimation technique; this provided satisfactory results in the simulative control of oxygenation and in the experimental three-tank system. The algorithm estimates the unknown parameters (wi) in the cost function and seeks the best parametric model to describe the system. In this particular application, no high computational costs are involved. However, the time of data collection should be long enough to collect all information on system dynamics, so that the cost function can be approximated more accurately. However, other parameter estimation techniques can be applied in a real implementation as an alternative, e.g. recursive least-squares algorithm for online implementation of the control algorithm (Ljung, 1999), the Kalman filter (Kalman, 1960), the Volterra series (Volterra, 1959), blockoriented structure (Pottmann & Pearson, 1998) or a neural network (Abu-Khalaf & Lewis, 2005) to approximate the Hamiltonian function and eventually solve the HJB equation. A partially model-free algorithm based on the PIA was effectively applied in the simulation of the first-order model with timedelay of oxygenation using porcine dynamics. A gradual improvement of the system performance was iteratively achieved and the criterion to stop the algorithm was almost no change in the estimated parameters of the cost function. Therefore, the algorithm gives an optimal converged solution. This approach is based not only on reinforcement learning with actor–critic architecture, but also on optimal control for long-term evaluation of system performance. In this simulation, after the third iteration, a satisfactory result was attained with no steady-state error and a settling time of 60 s, compared to the initial settling time of 600 s with the initial PI controller. This clearly demonstrates an optimal performance based on minimization of the predefined Hamiltonian cost function and improvement of the settling time, leading to optimal control of oxygenation and ultimately ameliorating the adverse effects of hypoxia using this control algorithm. Principally, the actor–critic structure is a two-step process of two different subsystems: policy improvement and policy evaluation. Policy improvement is to apply the computed control input (u*) from the critic subsystem into the plant, which is obtained by Eq. (5) based on a weighting function of the control input (R), the gain of input dynamics (g(x)), and the gradient of the estimated cost function. Therefore, there is no requirement for knowledge of the internal system dynamics; this is a clear advantage of this control strategy. However, knowledge of the system structure is required to be predetermined in terms of system order. Subsequently, the critic subsystem computes the infinite horizon cost associated with the provided controller and justifies the result whether to continue in another iteration, or to stop in this iteration. This structure leads to an adaptive learning process to optimally control the predefined output variable. However, a possible disadvantage can be an endless iteration (or a large number of iterations) before achieving a converged solution, which may be undesirable in real practice for closed-loop oxygen therapy. Therefore, further implementation is required in animal experiments to test this control strategy in a real clinical scenario. For both linear and nonlinear cases, the condition of stabilizability is assumed. Therefore, the first important issue for the implementation of this control strategy is to find an initial controller that can stabilize the system. For this particular application of ventilation therapy, oxygen therapy in a common practice can definitely improve oxygenation and it is provable whether the system is stabilizable by giving additional FiO2. Hence, the system is typically stabilizable for a treatable patient. However, for extreme pulmonary malfunction where oxygen therapy may not be sufficient, PIA may not be applicable. In such situation, other therapeutic approaches, for example extracorporeal membrane oxygenation (Hill et al., 1972; Walter et al., 2009), are required in order to improve oxygenation. In our implementation for the interconnecting three-tank system, the PIA gives the optimal result to control the water level with no requirement for knowledge of the internal nonlinear dynamics of the system. The settling time was approximately 130 s. This works well in the presence of uncertainty regarding the diameter and length of the connecting tubes, noise from the pressure sensors, and disturbance related to water ripple in the tanks (especially in Tank 1). Based on the practical implementation, a PIA controller should be realizable with optimal performance to cope with the nonlinear time-varying cardiopulmonary system of patients with respiratory insufficiency. As stated, it is a considerable advantage of this control strategy that we require minimal knowledge of the system dynamics to achieve an optimal control solution; moreover this should offer optimal outcome for the treatment of patients requiring oxygen therapy. In addition, it should be noted that our experiment based on the interconnecting three-tank system is implemented in a deterministic and timeinvariant model that does not capture all challenges of the real unknown and time-variant system dynamics. To introduce such conditions, further system modifications should be done for example, a setup of control valves between the tanks to introduce Please cite this article as: Pomprapa, A., et al. Optimal learning control of oxygen saturation using a policy iteration algorithm and a proof-of-concept in an interconnecting three-tank system. Control Engineering Practice (2016), http://dx.doi.org/10.1016/j. conengprac.2016.07.014i A. Pomprapa et al. / Control Engineering Practice ∎ (∎∎∎∎) ∎∎∎–∎∎∎ variable flow between the tanks to induce unknown dynamics for the whole system. Further system assessment with respect to this setup is useful for the experimental validation. Based on the practical implementation of this optimal control technique, the benefits include the following: a quasi model-free implementation for the control strategy, a smooth control performance, and efficient control policy based on the gradient of the estimated cost function (Doya, 2000). During the iteration process, some system parameters might gradually change and the algorithm then adaptively tunes the control input to cope with this change. Therefore, if the plant dynamics change significantly with respect to the time and no stable control input can be computed, the optimal solution may not be achieved. However, this was not experienced during testing of our practical model. For application of this algorithm in a real clinical situation, only a small change of patient dynamics or time invariance is assumed for each patient for a certain time frame. This assumption means that the patient cannot change his transfer function faster than convergence time of the PIA controller. The PIA controller should be able to provide an optimal solution. It should be emphasized that the learning process requires the iteration procedure and, if the system is stabilizable at the initial iteration, it will guarantee performance for patient safety, or hypoxia will not appear during the learning process. Our plan is to implement this control strategy for the control of oxygenation in the patient-in-the-loop configuration using porcine dynamics. For a feasibility study, both invasive measurement of SaO2 and noninvasive measurement of peripheral oxygen saturation ( SpO2) should be implemented to stabilize and control oxygenation. The online parameter estimation should also be carried out during policy evaluation at every iteration. With this promising technique, system integration in various clinical applications can be realized, e.g. in home-based mechanical ventilation for patients with apnea, during anesthesia, in the intensive care unit, or during the transport of patients with severe hypoxia. In addition, a policy iteration algorithm with /2 and / ∞ optimal state feedback controller may be introduced for application to improve robustness and disturbance rejection (Al-Tamimi, AbuKhalaf, & Lewis, 2007; Pomprapa, Misgeld, Sorgato et al., 2013). 8. Conclusion A policy iteration algorithm (PIA) has been proposed for nonlinear control of oxygenation, which could be a potential candidate for closed-loop oxygen therapy. Based on reinforcement learning and optimal control theory, policy improvement and policy evaluation are alternatively implemented and iterated until achieving an optimal solution of the Hamilton–Jacobi–Bellman (HJB) equation in order to minimize the Hamiltonian cost function. One advantage of this control strategy is that there is no requirement for knowledge of the system dynamics. Therefore, it should be suitable for this particular application, where the model of patient dynamics is barely known and system identification to achieve accurate model dynamics is time-consuming and critical for patients with severe hypoxia. In addition, this algorithm is implemented in a real non-linear plant of the interconnecting threetank system to demonstrate its control performance. The results acquired with this optimal controller indicate a promising control algorithm for possible implementation for the control of oxygenation. 9 constructive comments that have significantly helped to improve this paper. Appendix A. Evolution of unknown parameters wi The unknown parameters wi is computed based on the leastsquares algorithm. The curve of parameter evolution in every iteration is presented in a natural logarithmic scale to show the converged stable results of all parameters for the control of oxygenation (Fig. A1). Fig. A1. Evolution of the unknown parameters ( wi ) for oxygenation control of three iterations in a natural logarithmic scale. Appendix B. Evaluation of model selection The evaluation of different model alternatives is carried out based on the input-output response, given in Fig. 4. Different model structure is defined by ( np , nz , nd ), where np , nz and nd are the number of poles, the number of zeros and the value of timedelay, respectively. For example, the identified first-order model with time-delay of Eq. (9) can be represented by (1,0,11) (Table B1). Table B1 Evaluation of different model alternatives. Model ( np , nz , nd ) MSE based on system identification (%) MSE for validation (%) (1,0,11) (2,0,11) (2,1,11) (3,0,11) (3,1,11) (3,2,11) (4,0,11) (4,1,11) (4,2,11) (4,3,11) (4,3,0) 1.6379 1.6586 1.6631 1.6970 1.7040 1.6956 1.6944 1.7012 1.6880 1.6889 1.8310 0.5070 0.5167 0.5103 0.5377 0.5353 0.5326 0.5364 0.5337 0.5120 0.4929 0.5345 References Acknowledgment The authors thank to three anonymous reviewers for their Abu-Khalaf, M., & Lewis, F. (2005). Nearly optimal control laws for nonlinear systems with saturating actuators using a neural network HJB approach. Please cite this article as: Pomprapa, A., et al. Optimal learning control of oxygen saturation using a policy iteration algorithm and a proof-of-concept in an interconnecting three-tank system. Control Engineering Practice (2016), http://dx.doi.org/10.1016/j. conengprac.2016.07.014i 10 A. Pomprapa et al. / Control Engineering Practice ∎ (∎∎∎∎) ∎∎∎–∎∎∎ Automatica, 41(5), 779–791. Al-Tamimi, A., Abu-Khalaf, M., & Lewis, F. (2007). Model-free Q-learning designs for discrete-time zero-sum games with application to H-infiinity control. Automatica, 43(3), 473–482. ARDSNet (2000). Ventilation with lower tidal volume as compared with traditional tidal volumes for acute lung injury. New England Journal of Medicine, 342,1301– 1308. Ashbaugh, D., Bigelow, D., Petty, T., & Levine, B. (1967). Acute respiratory distress in adults. Lancet, 2(7511), 319–323. Ballard-Croft, C., Wang, D., Sumpter, R., Zhou, X., & Zwischenberger, J. B. (2012). Large-animal models of acute respiratory distress syndrome. The Annals of Thoracic Surgery, 93 (4), 1331–1339. Bhasin, S., Kamalapurkar, R., Johnson, M., Vamvoudakis, K., Lewis, F., & Dixon, W. (2013). A novel actor–critic–identifier architecture for approximate optimal control of uncertain nonlinear systems. Automatica, 49, 82–92. Brunner, J. (2002). History and principles of closed-loop control applied to mechanical ventilation. Nederlandse Vereniging voor Intensive Care, 6, 6–9. Clark, J. (1974). The toxicity of oxygen. American Review of Respiratory Disease, 110 (6Pt2), 40–50. Claure, N., & Bancalari, E. (2009). Automated respiratory support in newborn infants. Seminars in Fetal and Neonatal Medicine, 14, 35–41. Doya, K. (2000). Reinforcement learning in continuous time and space. Neural Computation, 12(1), 219–245. East, T., Tolle, C., MCJames, S., Farrell, R., & Brunner, J. (1991). A non-linear closedloop controller for oxygenation based on a clinically proven fifth dimensional quality surface. Anesthesiology, 75(3A), A468. Giard, M., Bertrand, F., Robert, D., & Pernier, J. (1985). An algorithm for automatic control of O2 and CO2 in artificial ventilation. IEEE Transactions on Biomedical Engineering BME, 32, 658–667. Hill, J., O'Brien, T., Murray, J., Dontigny, L., Bramson, M., Osborn, J., & Gerbode, F. (1972). Prolonged extracorporeal oxygenation for acute post-traumatic respiratory failure (shock-lung syndrome). Use of the bramson membrane lung. New England Journal of Medicine, 286, 629–634. Isermann, R., & Münchhof, M. (2010). Identification of dynamic systems: An introduction with applications. Springer. Kaelbling, L., Littman, M., & Moore, A. (1996). Reinforcement learning: A survey. Journal of Artificial Intelligence Research, 4, 237–285. Kallet, R., & Matthay, M. (2013). Hyperoxic acute lung injury. Respiratory Care, 58(1), 123–141. Kalman, R. (1960). A new approach to linear filtering and prediction problems. Journal of Basic Engineering, 82(1), 35–45. Kamalabady, A., & Salahshoor, K. (2009). Affine modeling of nonlinear multivariable processes using a new adaptive neural network-based approach. Journal of Process Control, 19(3), 380–393. Kron, A., Böhm, S., & Leonhardt, S. (1999). Analytic model for the CO2 mass balance in mechanically ventilated model. In European medical and biomedical engineering conference. Kwok, H., Linkens, D., Mahfouf, M., & Mills, G. (2004). Adaptive ventilator FiO2 advisor: Use of non-invasive estimations of shunt. Artificial Intelligence, 32, 157–169. Lachmann, B., Robertson, B., & Vogel, J. (1980). In vivo lung lavage as an experimental model of the respiratory distress syndrome. Scandinavian Society of Anaesthesiology, 24, 231–236. Ljung, L. (1999). System identification theory for the user. Prentice-Hall. Lu, G., & Ho, D. (2014). Robust H1 observer for nonlinear discrete systems with time delay and parameter uncertainties. IEE Proceedings—Control Theory and Applications, 151 (4), 439–444. Lüpschen, H., Zhu, L., & Leonhardt, S. (2009). Robust closed-loop control of the inspired fraction of oxygen for the online assessment of recruitment maneuvers. In Conference proceedings of the IEEE engineering in medicine and biology society (pp. 495–498). Mach, W., Thimmesch, A., Pierce, J. T., & Pierce, J. D. (2011). Consequences of hyperoxia and the toxicity of oxygen in the lung. Nursing Research and Practice, 2011, 260482. Mitamura, Y., Mikami, T., & Yamamoto, K. (1975). A dual control system for assisting respiration. Medical & Biological Engineering, 13(6), 846–853. Ohtake, S., & Yamakita, M. (2010). Adaptive output optimal control algorithm for unknown system dynamics based on policy iteration. In American control conference (pp. 1671–1676). Orani, N., Pisano, A., Franceschelli, M., Giua, A., & Usai, E. (2011). Robust reconstruction of the discrete state for a class of nonlinear uncertain switched systems. Nonlinear Analysis Hybrid Systems, 5, 220–232. Pomprapa, A., Alfocea, S., Göbel, C., Misgeld, B., & Leonhardt, S. (2014). Funnel control for oxygenation during artificial ventilation therapy. In: The 19th IFAC world congress. Pomprapa, A., Mir Wais, M., Walter, M., Misgeld, B., & Leonhardt, S. (2015). Policy iteration algorithm for the control of oxygenation. In The 19th IFAC symposium on biological and medical system (pp. 517–522). Pomprapa, A., Misgeld, B., Lachmann, B., & Leonhardt, S. (2013). Closed-loop ventilation of oxygenation and end-tidal CO2 . In IEEE systems, man and cybernetics society. Pomprapa, A., Misgeld, B., Sorgato, V., Stollenwerk, A., Walter, M., & Leonhardt, S. (2013). Robust control of end-tidal CO2 using the / ∞ loop-shaping approach. Acta Polytechnica, 53(6), 895–900. Pomprapa, A., Pikkemaat, R., Lüpschen, H., Lachmann, & B., Leonhardt, S. (2010). Self-tuning adaptive control of the arterial oxygen saturation for automated “Open Lung” maneuvers. In 3. Dresdner Medizintechnik-Symposium. Pomprapa, A., Schwaiberger, D., Pickerodt, P., Tjarks, O., Lachmann, B., & Leonhardt, S. (2014). Automatic protective ventilation using the ARDSNet protocol with the additional monitoring of electrical impedance tomography. Critical Care, 18, R128. Pottmann, M., & Pearson, R. (1998). Block-oriented NARMAX models with output multiplicities. AIChE Journal, 44, 131–140. Saihi, K., Richard, J.-C., Gonin, X., Krger, T., Dojat, M., & Brochard, L. (2014). Feasibility and reliability of an automated controller of inspired oxygen concentration during mechanical ventilation. Critical Care, 18, R35. Suda, H., & Yamakita, M. (2013). On adaptive optimal control for hybrid systems. In IEEE multi-conference on systems and control (pp. 859–864). Sun, F., Sun, Z., Li, N., & Zhang, L. (1999). Stable adaptive control for robot trajectory tracking using dynamic neural networks. Machine Intelligence & Robotic Control, 1(2), 71–78. Tarpy, S., & Celli, B. (1995). Long-term oxygen therapy. New England Journal of Medicine, 333, 710–714. Vamvoudakis, K., & Lewis, F. (2010). Online actor–critic algorithm to solve the continuous-time infinite horizon optimal control problem. Automatica, 46, 878–888. Vamvoudakis, K., Vrabie, D., & Lewis, F. (2009). Online policy iteration based algorithms to solve the continuous-time infinite horizon optimal control problem. In IEEE symposium on adaptive dynamic programming and reinforcement learning (pp. 36–41). Volterra, V. (1959). Theory of functionals and of integrals and integro-differential equations. Dover Publications,. Vrabie, D., Pastravanu, O., Abu-Khalaf, M., & Lewis, F. (2009). Adaptive optimal control for continuous-time linear systems based on policy iteration. Automatica, 45, 477–484. Walter, M., Wartzek, T., Kopp, R., Stollewerk, A., Kashefi, A., & Leonhardt, S. (2009). Automation of long term extracorporeal oxygenation systems. In European control conference (pp. 2482–2486). Wang, Y., Zhou, R., & Wen, C. (1993). Robust load-frequency controller design for power systems. In IEE proceedings generation, transmission and distribution (pp. 11–16). Wernrud, A. (2007). Strategies for computing switching feedback controllers. In American control conference (pp. 1389–1394). West, J., (2012). Respiratory physiology: The essentials (9th edition). LWW. Wukitsch, M., Petterson, M., D.R., T., & Pologe, J. (1988). Pulse oximetry: Analysis of theory, technology, and practice. Journal of Clinical Monitoring, 4 (4), 290-301. Xu, X., Hu, D., & Lu, X. (2007). Kernel-based least squares policy iteration for reinforcement learning. IEEE Transactions on Neural Networks, 18(4), 973–992. Yu, C., He, W., So, J., Roy, R., Kaufman, H., & Newell, J. (1987). Improvement in arterial oxygen control using multiple model adaptive control procedures. IEEE Transactions on Biomedical Engineering, 8, 567–574. Please cite this article as: Pomprapa, A., et al. Optimal learning control of oxygen saturation using a policy iteration algorithm and a proof-of-concept in an interconnecting three-tank system. Control Engineering Practice (2016), http://dx.doi.org/10.1016/j. conengprac.2016.07.014i