Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Pontryagin Principle and Sequential Quadratic Hamiltonian Methods

University of Pisa

Overview

The previous lectures introduced nonsmooth PDE-constrained optimization, active-set methods, and Newton-type algorithms for nonlinear optimality systems. We now return to smooth nonlinear control problems and look at them from a slightly different angle: the Hamiltonian structure of optimal control.

The goal of this lecture is to explain the Sequential Quadratic Hamiltonian method, usually abbreviated as SQH. The method can be interpreted as a structure-preserving Newton or SQP strategy applied to the full state, adjoint, and control system.

The main ideas are:

The method is especially useful for nonlinear PDE-constrained optimization, where reduced gradients are easy to compute but reduced Hessians are expensive to form explicitly.

Throughout the lecture, the emphasis is conceptual and algebraic. The linear algebra and implementation issues are the same ones we have already met in KKT and Newton systems: block structure, saddle-point matrices, preconditioning, and globalization.


From Optimality Systems to Hamiltonians

Consider an abstract PDE-constrained optimization problem:

min(y,u)J(y,u)\min_{(y,u)} J(y,u)

subject to

E(y,u)=0.\mathcal E(y,u)=0.

Here:

For example, a semilinear distributed control problem has the form

Δy+d(y)=u+fin Ω,-\Delta y + d(y) = u+f \qquad\text{in }\Omega,

with homogeneous Dirichlet boundary conditions, and cost functional

J(y,u)=12yydL2(Ω)2+α2uL2(Ω)2.J(y,u) = \frac12\|y-y_d\|_{L^2(\Omega)}^2 + \frac\alpha2\|u\|_{L^2(\Omega)}^2.

Introduce an adjoint variable pp and define the Lagrangian

L(y,u,p):=J(y,u)+p,E(y,u).\mathcal L(y,u,p) := J(y,u)+\langle p,\mathcal E(y,u)\rangle.

The first-order optimality system is

Lp(y,u,p)=0,Ly(y,u,p)=0,Lu(y,u,p)=0.\mathcal L_p(y,u,p)=0, \qquad \mathcal L_y(y,u,p)=0, \qquad \mathcal L_u(y,u,p)=0.

These equations are, respectively:

In this abstract PDE setting, the Hamiltonian is the function on state-control-adjoint variables defined by

H(y,u,p):=J(y,u)+p,E(y,u).\mathcal H(y,u,p) := J(y,u)+\langle p,\mathcal E(y,u)\rangle.

If the objective and the PDE residual are written through local densities, for instance

J(y,u)=Ω(y,u)dx,E(y,u)(x)=e(y(x),u(x)),J(y,u)=\int_\Omega \ell(y,u)\,dx, \qquad \mathcal E(y,u)(x)=e(y(x),u(x)),

then the corresponding Hamiltonian density is formally

H(y,u,p):=(y,u)+pe(y,u).H(y,u,p) := \ell(y,u)+p\,e(y,u).

The equations

Hp=0,Hy=0,Hu=0\mathcal H_p=0, \qquad \mathcal H_y=0, \qquad \mathcal H_u=0

are precisely the state, adjoint, and control stationarity equations written in Hamiltonian form.


Pontryagin Principle for the Abstract PDE Problem

For the abstract problem introduced above, the Pontryagin principle can be read as a first-order optimality system in function spaces.

Let

minuUadJ(y,u)\min_{u\in U_{ad}} J(y,u)

subject to

E(y,u)=0,\mathcal E(y,u)=0,

where

E:Y×UP\mathcal E:Y\times U\to P^*

maps the state and control into the dual of the adjoint space. Assume for the moment that the problem is smooth and that a constraint qualification holds at a local solution (yˉ,uˉ)(\bar y,\bar u).

Then there exists an adjoint variable

pˉP\bar p\in P

such that the Lagrangian

L(y,u,p):=J(y,u)+p,E(y,u)P,P\mathcal L(y,u,p) := J(y,u)+\langle p,\mathcal E(y,u)\rangle_{P,P^*}

satisfies the following conditions.

The state equation is recovered by differentiating with respect to the adjoint variable:

Lp(yˉ,uˉ,pˉ)[q]=q,E(yˉ,uˉ)P,P=0qP.\mathcal L_p(\bar y,\bar u,\bar p)[q] = \langle q,\mathcal E(\bar y,\bar u)\rangle_{P,P^*} =0 \qquad \forall q\in P.

Equivalently,

E(yˉ,uˉ)=0.\mathcal E(\bar y,\bar u)=0.

The adjoint equation is obtained by differentiating with respect to the state:

Ly(yˉ,uˉ,pˉ)[z]=Jy(yˉ,uˉ)[z]+pˉ,Ey(yˉ,uˉ)zP,P=0zY.\mathcal L_y(\bar y,\bar u,\bar p)[z] = J_y(\bar y,\bar u)[z] + \langle \bar p,\mathcal E_y(\bar y,\bar u)z\rangle_{P,P^*} =0 \qquad \forall z\in Y.

In operator notation this is

Ey(yˉ,uˉ)pˉ+Jy(yˉ,uˉ)=0in Y.\mathcal E_y(\bar y,\bar u)^*\bar p + J_y(\bar y,\bar u) =0 \qquad\text{in }Y^*.

Finally, the control condition is

Lu(yˉ,uˉ,pˉ)[vuˉ]0vUad.\mathcal L_u(\bar y,\bar u,\bar p)[v-\bar u]\ge 0 \qquad \forall v\in U_{ad}.

If Uad=UU_{ad}=U, this reduces to the stationarity equation

Eu(yˉ,uˉ)pˉ+Ju(yˉ,uˉ)=0in U.\mathcal E_u(\bar y,\bar u)^*\bar p + J_u(\bar y,\bar u) =0 \qquad\text{in }U^*.

If UadU_{ad} is a closed convex set, the same condition is a variational inequality:

Ju(yˉ,uˉ)+Eu(yˉ,uˉ)pˉ,vuˉU,U0vUad.\left\langle J_u(\bar y,\bar u) + \mathcal E_u(\bar y,\bar u)^*\bar p, v-\bar u \right\rangle_{U^*,U} \ge 0 \qquad \forall v\in U_{ad}.

This is the infinite-dimensional version of the maximum principle. In the unconstrained case the Hamiltonian is stationary with respect to the control; with control constraints the optimal control minimizes the Hamiltonian over the admissible set.

More explicitly, if the PDE and the cost admit local densities, one may use the Hamiltonian density

H(y,u,p):=(y,u)+pe(y,u),H(y,u,p) := \ell(y,u)+p\,e(y,u),

and the Pontryagin condition becomes the pointwise or weak minimization condition

H(yˉ,uˉ,pˉ)=minvUadH(yˉ,v,pˉ),H(\bar y,\bar u,\bar p) = \min_{v\in U_{ad}} H(\bar y,v,\bar p),

or, in the smooth unconstrained case,

Hu(yˉ,uˉ,pˉ)=0.H_u(\bar y,\bar u,\bar p)=0.

For PDE-constrained optimization, this statement is usually interpreted weakly through the Lagrangian derivatives above. The adjoint variable is the dual Hamiltonian variable, and the triple

(yˉ,uˉ,pˉ)(\bar y,\bar u,\bar p)

solves a coupled state-adjoint-control system.


Hamiltonian Form of a PDE-Constrained Problem

Let YY, UU, and PP be Hilbert spaces for the state, control, and adjoint. Suppose the PDE residual is

E(y,u)=Ay+d(y)Buf.\mathcal E(y,u)=Ay+d(y)-Bu-f.

The corresponding Lagrangian is

L(y,u,p)=J(y,u)+p,Ay+d(y)Buf.\mathcal L(y,u,p) = J(y,u) + \langle p,Ay+d(y)-Bu-f\rangle.

For a quadratic tracking cost

J(y,u)=12yydY2+α2uU2,J(y,u) = \frac12\|y-y_d\|_Y^2 + \frac\alpha2\|u\|_U^2,

the first-order system reads

Ay+d(y)Buf=0,Ay+d(y)-Bu-f=0,
Ap+d(y)p+yyd=0,A^*p+d'(y)^*p+y-y_d=0,
αuBp=0.\alpha u-B^*p=0.

The nonlinear residual can be written compactly as

F(y,u,p)=(Ay+d(y)BufAp+d(y)p+yydαuBp)=0.F(y,u,p) = \begin{pmatrix} Ay+d(y)-Bu-f\\ A^*p+d'(y)^*p+y-y_d\\ \alpha u-B^*p \end{pmatrix} =0.

A standard full-space Newton method would solve

F(yk,uk,pk)(δyδuδp)=F(yk,uk,pk).F'(y_k,u_k,p_k) \begin{pmatrix} \delta y\\ \delta u\\ \delta p \end{pmatrix} = -F(y_k,u_k,p_k).

This Newton viewpoint is useful, but it hides the specific optimal-control idea behind SQH. The method comes from the Pontryagin principle and from successive pointwise Hamiltonian optimization. We now build that path in steps.


Pontryagin Maximum Principle

We write the controlled PDE in the abstract form

E(y,u)=0,\mathcal E(y,u)=0,

and assume that for every admissible control uUadu\in U_{ad} there is a unique state

yu=S(u).y_u=S(u).

The reduced cost is

j(u):=J(yu,u).j(u):=J(y_u,u).

To connect with the SQH algorithm, it is useful to write the PDE locally as

e(z,y(z),u(z))=0,e(z,y(z),u(z))=0,

where zz denotes the independent variable. Depending on the model, zz may be:

This notation suppresses derivatives and weak-form terms. For example, in an elliptic PDE the symbol ee contains the differential operator acting on yy, not only the point value of yy.

The Hamiltonian density is written as

H(z,y,v,u,p),H(z,y,v,u,p),

where:

For the unregularized Pontryagin principle the dependence on the reference control is absent, and we simply write

H(z,y,v,p).H(z,y,v,p).

The Pontryagin Maximum Principle states that, if uˉ\bar u is optimal and yˉ=yuˉ\bar y=y_{\bar u} is the corresponding state, then there exists an adjoint state pˉ\bar p such that:

  1. yˉ\bar y solves the state equation with control uˉ\bar u;

  2. pˉ\bar p solves the adjoint equation associated with (yˉ,uˉ)(\bar y,\bar u);

  3. the optimal control satisfies the pointwise Hamiltonian condition

H(z,yˉ(z),uˉ(z),pˉ(z))=maxvKad(z)H(z,yˉ(z),v,pˉ(z))H(z,\bar y(z),\bar u(z),\bar p(z)) = \max_{v\in K_{ad}(z)} H(z,\bar y(z),v,\bar p(z))

for almost every zz.

Here Kad(z)K_{ad}(z) is the set of pointwise admissible values. For example, for box constraints,

Kad(z)=[ua(z),ub(z)].K_{ad}(z)=[u_a(z),u_b(z)].

The sign convention is important but not essential. With the opposite sign in the adjoint equation, the maximum condition becomes an equivalent minimum condition. The algorithmic point is the same: after the state and adjoint are known, the control update is obtained from a local optimization problem in the variable vv.


Rozonoer Estimate

The reason the Pontryagin condition is algorithmically useful is that Hamiltonian improvement implies cost improvement, up to higher-order terms. This idea goes back to Rozonoer’s analysis of successive approximation methods for optimal control.

Let uu and vv be two admissible controls, with corresponding states

yu=S(u),yv=S(v).y_u=S(u), \qquad y_v=S(v).

Let pup_u be the adjoint associated with (yu,u)(y_u,u). Under the usual smoothness and stability assumptions, one can estimate the cost difference as

J(yv,v)J(yu,u)Z[H(z,yu(z),v(z),pu(z))H(z,yu(z),u(z),pu(z))]dz+CvuU2.J(y_v,v)-J(y_u,u) \le - \int_Z \left[ H(z,y_u(z),v(z),p_u(z)) - H(z,y_u(z),u(z),p_u(z)) \right]\,dz\\ \quad+ C\|v-u\|_U^2.

The domain ZZ is the variable domain of the control: it may be a time interval, a spatial domain, or a space-time cylinder.

The estimate should be read as follows:

Thus, if vv increases the Hamiltonian enough pointwise, then the total cost decreases.

This is the key estimate behind successive approximation schemes. It turns a global optimal-control problem into repeated local Hamiltonian optimizations, followed by a global state solve.


Successive Approximation Schemes

A basic successive approximation scheme starts from a control u0u^0 and then repeats the following operations.

Given uku^k:

  1. solve the state equation to obtain yk=yuky^k=y_{u^k};

  2. solve the adjoint equation associated with (yk,uk)(y^k,u^k) to obtain pkp^k;

  3. compute a new control by pointwise Hamiltonian maximization,

uk+1(z)arg maxwKad(z)H(z,yk(z),w,pk(z));u^{k+1}(z) \in \operatorname*{arg\,max}_{w\in K_{ad}(z)} H(z,y^k(z),w,p^k(z));
  1. solve the state equation again with control uk+1u^{k+1}.

The central feature is Step 3. It is not a PDE solve. It is a pointwise optimization problem. In many important cases it can be computed explicitly:

This explains why the PMP is attractive computationally. The expensive operations are the state and adjoint solves; the control update is local.

The weakness of the basic scheme is that the pointwise maximizer may be too aggressive. It can produce a control far from uku^k, so the quadratic remainder in the Rozonoer estimate may dominate the Hamiltonian gain. In that case the cost may fail to decrease.


Robust Successive Approximation

To stabilize the method, one modifies the Hamiltonian by penalizing large changes in the control. Given a parameter ε>0\varepsilon>0, define

Hε(z,y,w,u,p):=H(z,y,w,p)εwu2.H_\varepsilon(z,y,w,u,p) := H(z,y,w,p) - \varepsilon |w-u|^2.

The reference value uu is the current control. At iteration kk, the local control update becomes

uk+1(z)arg maxwKad(z)Hε(z,yk(z),w,uk(z),pk(z)).u^{k+1}(z) \in \operatorname*{arg\,max}_{w\in K_{ad}(z)} H_\varepsilon(z,y^k(z),w,u^k(z),p^k(z)).

The penalty term has two effects:

If ε\varepsilon is too small, the method may still be unstable. If ε\varepsilon is too large, the update is very small and convergence becomes slow. Therefore ε\varepsilon is adapted during the iteration.

The descent test is based on the actual cost decrease. Let

τk:=uk+1ukL2(Z)2.\tau_k:=\|u^{k+1}-u^k\|_{L^2(Z)}^2.

For a prescribed η>0\eta>0, accept the new control if

J(yk+1,uk+1)J(yk,uk)ητk.J(y^{k+1},u^{k+1})-J(y^k,u^k) \le - \eta \tau_k.

If the test fails, increase ε\varepsilon and solve the pointwise optimization problem again. If the test succeeds, decrease ε\varepsilon so that the next iteration can try a less conservative update.

This is the robust successive approximation mechanism that leads directly to the SQH algorithm.


The Sequential Quadratic Hamiltonian Algorithm

The Sequential Quadratic Hamiltonian method uses the robust Hamiltonian

Hε(z,y,w,u,p)=H(z,y,w,p)εwu2H_\varepsilon(z,y,w,u,p) = H(z,y,w,p)-\varepsilon |w-u|^2

inside a successive approximation loop.

The word “quadratic” refers to the stabilizing quadratic term in the control increment. The algorithm is still driven by the Pontryagin condition: at each iteration the new control is obtained by solving a pointwise Hamiltonian optimization problem.

Choose:

Set

τ>κ,k=0,\tau>\kappa, \qquad k=0,

and compute the initial state y0y^0 from the governing model with control u0u^0.

While

k<kmaxandτ>κ,k<k_{\max} \qquad\text{and}\qquad \tau>\kappa,

perform the following steps.

  1. Compute the adjoint pkp^k associated with the current pair (yk,uk)(y^k,u^k).

  2. Determine uk+1u^{k+1} by solving the pointwise optimization problem

Hε(z,yk(z),uk+1(z),uk(z),pk(z))=maxwKad(z)Hε(z,yk(z),w,uk(z),pk(z))H_\varepsilon \left( z,y^k(z),u^{k+1}(z),u^k(z),p^k(z) \right) = \max_{w\in K_{ad}(z)} H_\varepsilon \left( z,y^k(z),w,u^k(z),p^k(z) \right)

for almost every zZz\in Z.

This is the defining local step of SQH. The variable zz may be tt, xx, or (x,t)(x,t), depending on the control problem.

  1. Compute the new state yk+1y^{k+1} by solving the governing model with control uk+1u^{k+1}.

  2. Compute the update size

τ:=uk+1ukL2(Z)2.\tau:=\|u^{k+1}-u^k\|_{L^2(Z)}^2.
  1. Check the actual cost decrease.

If

J(yk+1,uk+1)J(yk,uk)>ητ,J(y^{k+1},u^{k+1})-J(y^k,u^k) > - \eta\tau,

then the decrease is not sufficient. Increase the regularization parameter,

ε:=σε,\varepsilon:=\sigma\varepsilon,

and return to Step 2 with the same (yk,uk,pk)(y^k,u^k,p^k).

If instead

J(yk+1,uk+1)J(yk,uk)ητ,J(y^{k+1},u^{k+1})-J(y^k,u^k) \le - \eta\tau,

then accept the update, decrease the regularization parameter,

ε:=ζε,\varepsilon:=\zeta\varepsilon,

and continue.

  1. Set

k:=k+1.k:=k+1.

The loop stops when the control update is smaller than the tolerance or when the maximum number of iterations is reached.


Convergence Theorem

The SQH method is a successive approximation scheme with an augmented Hamiltonian. The role of the adaptive parameter ε\varepsilon is to guarantee a sufficient decrease of the cost functional. In Step 2, an exact pointwise maximization is convenient, but the convergence mechanism only needs a sufficient, possibly partial, Hamiltonian improvement.

The following theorem records the key descent estimate. We state it without proof.

Theorem. Let

(yk,uk)and(yk+1,uk+1)(y^k,u^k) \qquad\text{and}\qquad (y^{k+1},u^{k+1})

be generated by the SQH algorithm, and assume that uku^k and uk+1u^{k+1} are measurable. Under appropriate smoothness, boundedness, and stability assumptions on the state equation and on the running cost \ell, there exists a constant

θ>0\theta>0

independent of ε>0\varepsilon>0 such that, for the value of ε\varepsilon currently chosen by the SQH algorithm,

J(yk+1,uk+1)J(yk,uk)(εθ)uk+1ukL2(Z)2.J(y^{k+1},u^{k+1}) - J(y^k,u^k) \le - (\varepsilon-\theta) \|u^{k+1}-u^k\|_{L^2(Z)}^2.

In particular, if

εθ+η,\varepsilon\ge \theta+\eta,

and

τ=uk+1ukL2(Z)2,\tau=\|u^{k+1}-u^k\|_{L^2(Z)}^2,

then

J(yk+1,uk+1)J(yk,uk)ητ.J(y^{k+1},u^{k+1}) - J(y^k,u^k) \le - \eta\tau.

This is exactly the sufficient-decrease condition used in Step 5 of the algorithm. Therefore, if a trial update does not decrease the cost enough, increasing ε\varepsilon eventually makes the Hamiltonian update conservative enough to satisfy the acceptance test.


The algorithm can therefore be summarized in one sentence:

SQH alternates global state-adjoint solves with a pointwise maximization of a quadratically regularized Hamiltonian.

This is the essential distinction from a generic Newton method. Newton linearizes the full KKT system; SQH uses the Pontryagin structure to turn the control step into a local Hamiltonian optimization problem, while the PDE coupling remains in the state and adjoint equations.


Course Summary: Algorithmic Map

We close the course with a compact map of the main algorithmic families we have seen. The table is deliberately practical: it compares what is solved at each iteration, the dominant computational cost, and the kind of admissible sets each method naturally handles.

Here “one PDE solve” means one elliptic solve in stationary problems, or one full forward or backward time march in parabolic problems.

MethodFormulationTypical cost per iterationAdvantagesLimitationsAdmissible sets and nonsmoothness
Reduced gradient descentReduced: optimize j(u)=J(S(u),u)j(u)=J(S(u),u)One state solve + one adjoint solve; line search adds extra state solvesSimple, robust, low memory, easy to implementOften slow; sensitive to scaling; first-order onlyBest for smooth convex Uad=UU_{ad}=U; can handle simple constraints only through projection or penalties
Armijo / Wolfe line searchGlobalization layer for reduced methodsSeveral trial cost evaluations, hence extra state solvesGives reliable decrease; stabilizes gradient, CG, BFGSCan dominate cost if each trial requires a PDE solveWorks for smooth problems; projected variants needed for constrained sets
Nonlinear conjugate gradientReducedOne gradient evaluation per accepted step, plus line searchBetter than steepest descent with little extra memoryLess robust on strongly nonlinear or nonsmooth problemsMostly smooth Uad=UU_{ad}=U or simple projected variants
BFGS / L-BFGSReduced quasi-NewtonOne gradient evaluation + line search; stores curvature pairsOften much faster than gradient descent; no exact HessianNeeds smoothness and good line search; curvature updates can fail near nonsmooth active setsSmooth problems; L-BFGS-B-style variants for boxes, but nonsmooth terms need splitting
Reduced Newton / trust regionReduced second-orderHessian or Hessian-vector products; each product may require incremental state/adjoint solvesFast local convergence; good for nonlinear smooth problemsMore complex; expensive linear algebra; needs globalizationSmooth nonconvex problems if Hessian model and globalization are adequate; constraints require SQP/TR machinery
All-at-once KKT solveSimultaneous state-adjoint-control systemOne large saddle-point linear solve for linear-quadratic problemsSolves linear-quadratic unconstrained problems in one shot; exposes block structureIndefinite systems; preconditioning is essentialNatural for equality-constrained smooth problems; inequalities need complementarity or active sets
Projected gradientReduced constrainedOne state + one adjoint + pointwise projection; line search may add state solvesVery simple for box constraints; active set visible through saturationFirst-order convergence; step-size dependentExcellent for closed convex simple sets such as boxes; not suitable for nonconvex UadU_{ad} without modifications
Primal-dual active set (PDAS)KKT/complementarityActive-set prediction + constrained KKT solve per iterationOften finite or very fast active-set convergence; natural for box constraintsRequires good active-set logic and linear solvers; can oscillate without safeguardsVery good for convex box constraints and complementarity systems; equivalent to semismooth Newton in many cases
Semismooth NewtonNonsmooth equation / generalized derivativeLinearized generalized KKT solve per iterationSuperlinear local convergence for structured nonsmoothnessMore technical; needs semismooth reformulation and active-set identificationExcellent for projections, max/min, L1L^1 terms, complementarity; usually assumes convex structure
Subgradient descentReduced nonsmoothOne state + one adjoint + choice of subgradientMost elementary nonsmooth method; conceptually robustVery slow; difficult step-size tuning; subgradient may be nonuniqueHandles convex nonsmooth functionals such as L1L^1, but rarely the best computational choice
Proximal gradient / forward-backward splittingReduced composite g(u)+ψ(u)g(u)+\psi(u)One gradient evaluation for gg + local proximal map for ψ\psiExploits exact nonsmooth structure; cheap local updates; natural for sparsityStill first-order; needs proximal map and step-size controlExcellent for convex nonsmooth terms such as L1L^1 and boxes when prox/projection is explicit
Sparse-control PDASSlack-variable or multiplier KKTActive-set update + linear solve; often few iterations near solutionCaptures sparsity pattern sharply; much faster than subgradient methodsMore implementation effort; active sets can be delicateStrong for convex L1L^1-type sparsity and box constraints; relies on complementarity structure
One-shot Newton KKT for nonlinear/inverse problemsSimultaneous nonlinear KKTAssemble residual/Jacobian + large Newton correction; line search/damping may add residual evaluationsTreats state, adjoint, and parameter together; powerful for inverse problemsNonlinear saddle-point systems; preconditioning and globalization are hardSmooth nonconvex problems possible, but only local; boxes require PDAS or projection safeguards
SQPFull-space or reduced constrained optimizationQuadratic subproblem with linearized constraints; cost depends on subproblem solveGeneral framework for smooth constrained nonlinear problemsHeavy machinery; globalization and Hessian approximation matterHandles smooth convex constraints well; nonconvex constraints possible but only with local guarantees
SQHPontryagin / Hamiltonian successive approximationOne adjoint solve + pointwise Hamiltonian maximization + one state solve; repeated local solve if ε\varepsilon changesUses PMP structure; control update is local; adaptive ε\varepsilon guarantees sufficient decreaseDepends on Hamiltonian structure; still local for nonconvex problems; theory needs smoothness/stability assumptionsVery flexible for pointwise admissible sets Kad(z)K_{ad}(z), including nonconvex or finite-valued sets, because Step 2 is a local optimization

The main dividing lines are these:

As a rule of thumb: