Pontryagin Principle and Sequential Quadratic Hamiltonian Methods - NMOPT

Overview¶

The previous lectures introduced nonsmooth PDE-constrained optimization, active-set methods, and Newton-type algorithms for nonlinear optimality systems. We now return to smooth nonlinear control problems and look at them from a slightly different angle: the Hamiltonian structure of optimal control.

The goal of this lecture is to explain the Sequential Quadratic Hamiltonian method, usually abbreviated as SQH. The method can be interpreted as a structure-preserving Newton or SQP strategy applied to the full state, adjoint, and control system.

The main ideas are:

optimal control problems naturally generate Hamiltonian systems;
the state and adjoint equations form a primal-dual pair;
second-order methods can be written without explicitly assembling a reduced Hessian;
SQH solves a sequence of quadratic optimal control subproblems obtained by local Hamiltonian approximation.

The method is especially useful for nonlinear PDE-constrained optimization, where reduced gradients are easy to compute but reduced Hessians are expensive to form explicitly.

Throughout the lecture, the emphasis is conceptual and algebraic. The linear algebra and implementation issues are the same ones we have already met in KKT and Newton systems: block structure, saddle-point matrices, preconditioning, and globalization.

From Optimality Systems to Hamiltonians¶

Consider an abstract PDE-constrained optimization problem:

\min_{(y,u)} J(y,u)

(1)

subject to

\mathcal E(y,u)=0.

(2)

Here:

$y$ is the state;
$u$ is the control;
$\mathcal E(y,u)=0$ is the state equation.

For example, a semilinear distributed control problem has the form

-\Delta y + d(y) = u+f \qquad\text{in }\Omega,

(3)

with homogeneous Dirichlet boundary conditions, and cost functional

J(y,u) = \frac12\|y-y_d\|_{L^2(\Omega)}^2 + \frac\alpha2\|u\|_{L^2(\Omega)}^2.

(4)

Introduce an adjoint variable $p$ and define the Lagrangian

\mathcal L(y,u,p) := J(y,u)+\langle p,\mathcal E(y,u)\rangle.

(5)

The first-order optimality system is

\mathcal L_p(y,u,p)=0, \qquad \mathcal L_y(y,u,p)=0, \qquad \mathcal L_u(y,u,p)=0.

(6)

These equations are, respectively:

the state equation;
the adjoint equation;
the stationarity equation with respect to the control.

In this abstract PDE setting, the Hamiltonian is the function on state-control-adjoint variables defined by

\mathcal H(y,u,p) := J(y,u)+\langle p,\mathcal E(y,u)\rangle.

(7)

If the objective and the PDE residual are written through local densities, for instance

J(y,u)=\int_\Omega \ell(y,u)\,dx, \qquad \mathcal E(y,u)(x)=e(y(x),u(x)),

(8)

then the corresponding Hamiltonian density is formally

H(y,u,p) := \ell(y,u)+p\,e(y,u).

(9)

The equations

\mathcal H_p=0, \qquad \mathcal H_y=0, \qquad \mathcal H_u=0

(10)

are precisely the state, adjoint, and control stationarity equations written in Hamiltonian form.

Pontryagin Principle for the Abstract PDE Problem¶

For the abstract problem introduced above, the Pontryagin principle can be read as a first-order optimality system in function spaces.

Let

\min_{u\in U_{ad}} J(y,u)

(11)

subject to

\mathcal E(y,u)=0,

(12)

where

\mathcal E:Y\times U\to P^*

(13)

maps the state and control into the dual of the adjoint space. Assume for the moment that the problem is smooth and that a constraint qualification holds at a local solution $(\bar y,\bar u)$ .

Then there exists an adjoint variable

\bar p\in P

(14)

such that the Lagrangian

\mathcal L(y,u,p) := J(y,u)+\langle p,\mathcal E(y,u)\rangle_{P,P^*}

(15)

satisfies the following conditions.

The state equation is recovered by differentiating with respect to the adjoint variable:

\mathcal L_p(\bar y,\bar u,\bar p)[q] = \langle q,\mathcal E(\bar y,\bar u)\rangle_{P,P^*} =0 \qquad \forall q\in P.

(16)

Equivalently,

\mathcal E(\bar y,\bar u)=0.

(17)

The adjoint equation is obtained by differentiating with respect to the state:

\mathcal L_y(\bar y,\bar u,\bar p)[z] = J_y(\bar y,\bar u)[z] + \langle \bar p,\mathcal E_y(\bar y,\bar u)z\rangle_{P,P^*} =0 \qquad \forall z\in Y.

(18)

In operator notation this is

\mathcal E_y(\bar y,\bar u)^*\bar p + J_y(\bar y,\bar u) =0 \qquad\text{in }Y^*.

(19)

Finally, the control condition is

\mathcal L_u(\bar y,\bar u,\bar p)[v-\bar u]\ge 0 \qquad \forall v\in U_{ad}.

(20)

If $U_{ad}=U$ , this reduces to the stationarity equation

\mathcal E_u(\bar y,\bar u)^*\bar p + J_u(\bar y,\bar u) =0 \qquad\text{in }U^*.

(21)

If $U_{ad}$ is a closed convex set, the same condition is a variational inequality:

\left\langle J_u(\bar y,\bar u) + \mathcal E_u(\bar y,\bar u)^*\bar p, v-\bar u \right\rangle_{U^*,U} \ge 0 \qquad \forall v\in U_{ad}.

(22)

This is the infinite-dimensional version of the maximum principle. In the unconstrained case the Hamiltonian is stationary with respect to the control; with control constraints the optimal control minimizes the Hamiltonian over the admissible set.

More explicitly, if the PDE and the cost admit local densities, one may use the Hamiltonian density

H(y,u,p) := \ell(y,u)+p\,e(y,u),

(23)

and the Pontryagin condition becomes the pointwise or weak minimization condition

H(\bar y,\bar u,\bar p) = \min_{v\in U_{ad}} H(\bar y,v,\bar p),

(24)

or, in the smooth unconstrained case,

H_u(\bar y,\bar u,\bar p)=0.

(25)

For PDE-constrained optimization, this statement is usually interpreted weakly through the Lagrangian derivatives above. The adjoint variable is the dual Hamiltonian variable, and the triple

(\bar y,\bar u,\bar p)

(26)

solves a coupled state-adjoint-control system.

Hamiltonian Form of a PDE-Constrained Problem¶

Let $Y$ , $U$ , and $P$ be Hilbert spaces for the state, control, and adjoint. Suppose the PDE residual is

\mathcal E(y,u)=Ay+d(y)-Bu-f.

(27)

The corresponding Lagrangian is

\mathcal L(y,u,p) = J(y,u) + \langle p,Ay+d(y)-Bu-f\rangle.

(28)

For a quadratic tracking cost

J(y,u) = \frac12\|y-y_d\|_Y^2 + \frac\alpha2\|u\|_U^2,

(29)

the first-order system reads

Ay+d(y)-Bu-f=0,

(30)

A^*p+d'(y)^*p+y-y_d=0,

(31)

\alpha u-B^*p=0.

(32)

The nonlinear residual can be written compactly as

F(y,u,p) = \begin{pmatrix} Ay+d(y)-Bu-f\\ A^*p+d'(y)^*p+y-y_d\\ \alpha u-B^*p \end{pmatrix} =0.

(33)

A standard full-space Newton method would solve

F'(y_k,u_k,p_k) \begin{pmatrix} \delta y\\ \delta u\\ \delta p \end{pmatrix} = -F(y_k,u_k,p_k).

(34)

This Newton viewpoint is useful, but it hides the specific optimal-control idea behind SQH. The method comes from the Pontryagin principle and from successive pointwise Hamiltonian optimization. We now build that path in steps.

Pontryagin Maximum Principle¶

We write the controlled PDE in the abstract form

\mathcal E(y,u)=0,

(35)

and assume that for every admissible control $u\in U_{ad}$ there is a unique state

y_u=S(u).

(36)

The reduced cost is

j(u):=J(y_u,u).

(37)

To connect with the SQH algorithm, it is useful to write the PDE locally as

e(z,y(z),u(z))=0,

(38)

where $z$ denotes the independent variable. Depending on the model, $z$ may be:

a time variable $t$ ;
a space variable $x$ ;
a space-time variable $(x,t)$ .

This notation suppresses derivatives and weak-form terms. For example, in an elliptic PDE the symbol $e$ contains the differential operator acting on $y$ , not only the point value of $y$ .

The Hamiltonian density is written as

H(z,y,v,u,p),

(39)

where:

$y$ is the state at the current point;
$v$ is the trial control value;
$u$ is the reference or current control value, included because the regularized SQH Hamiltonian below depends on it;
$p$ is the adjoint variable.

For the unregularized Pontryagin principle the dependence on the reference control is absent, and we simply write

H(z,y,v,p).

(40)

The Pontryagin Maximum Principle states that, if $\bar u$ is optimal and $\bar y=y_{\bar u}$ is the corresponding state, then there exists an adjoint state $\bar p$ such that:

$\bar y$ solves the state equation with control $\bar u$ ;
$\bar p$ solves the adjoint equation associated with $(\bar y,\bar u)$ ;
the optimal control satisfies the pointwise Hamiltonian condition

H(z,\bar y(z),\bar u(z),\bar p(z)) = \max_{v\in K_{ad}(z)} H(z,\bar y(z),v,\bar p(z))

(41)

for almost every $z$ .

Here $K_{ad}(z)$ is the set of pointwise admissible values. For example, for box constraints,

K_{ad}(z)=[u_a(z),u_b(z)].

(42)

The sign convention is important but not essential. With the opposite sign in the adjoint equation, the maximum condition becomes an equivalent minimum condition. The algorithmic point is the same: after the state and adjoint are known, the control update is obtained from a local optimization problem in the variable $v$ .

Rozonoer Estimate¶

The reason the Pontryagin condition is algorithmically useful is that Hamiltonian improvement implies cost improvement, up to higher-order terms. This idea goes back to Rozonoer’s analysis of successive approximation methods for optimal control.

Let $u$ and $v$ be two admissible controls, with corresponding states

y_u=S(u), \qquad y_v=S(v).

(43)

Let $p_u$ be the adjoint associated with $(y_u,u)$ . Under the usual smoothness and stability assumptions, one can estimate the cost difference as

J(y_v,v)-J(y_u,u) \le - \int_Z \left[ H(z,y_u(z),v(z),p_u(z)) - H(z,y_u(z),u(z),p_u(z)) \right]\,dz\\ \quad+ C\|v-u\|_U^2.

(44)

The domain $Z$ is the variable domain of the control: it may be a time interval, a spatial domain, or a space-time cylinder.

The estimate should be read as follows:

the leading term is the Hamiltonian gain obtained by replacing $u$ with $v$ while freezing the state and adjoint at $(y_u,p_u)$ ;
the last term is the price paid for the fact that the true state changes from $y_u$ to $y_v$ ;
for small control changes, the quadratic remainder is dominated by the Hamiltonian gain.

Thus, if $v$ increases the Hamiltonian enough pointwise, then the total cost decreases.

This is the key estimate behind successive approximation schemes. It turns a global optimal-control problem into repeated local Hamiltonian optimizations, followed by a global state solve.

Successive Approximation Schemes¶

A basic successive approximation scheme starts from a control $u^0$ and then repeats the following operations.

Given $u^k$ :

solve the state equation to obtain $y^k=y_{u^k}$ ;
solve the adjoint equation associated with $(y^k,u^k)$ to obtain $p^k$ ;
compute a new control by pointwise Hamiltonian maximization,

u^{k+1}(z) \in \operatorname*{arg\,max}_{w\in K_{ad}(z)} H(z,y^k(z),w,p^k(z));

(45)

solve the state equation again with control $u^{k+1}$ .

The central feature is Step 3. It is not a PDE solve. It is a pointwise optimization problem. In many important cases it can be computed explicitly:

for box constraints, by checking endpoints or projecting a stationary point;
for finite-valued controls, by comparing finitely many Hamiltonian values;
for quadratic control costs, by a local scalar or vector quadratic optimization.

This explains why the PMP is attractive computationally. The expensive operations are the state and adjoint solves; the control update is local.

The weakness of the basic scheme is that the pointwise maximizer may be too aggressive. It can produce a control far from $u^k$ , so the quadratic remainder in the Rozonoer estimate may dominate the Hamiltonian gain. In that case the cost may fail to decrease.

Robust Successive Approximation¶

To stabilize the method, one modifies the Hamiltonian by penalizing large changes in the control. Given a parameter $\varepsilon>0$ , define

H_\varepsilon(z,y,w,u,p) := H(z,y,w,p) - \varepsilon |w-u|^2.

(46)

The reference value $u$ is the current control. At iteration $k$ , the local control update becomes

u^{k+1}(z) \in \operatorname*{arg\,max}_{w\in K_{ad}(z)} H_\varepsilon(z,y^k(z),w,u^k(z),p^k(z)).

(47)

The penalty term has two effects:

it keeps the new control close to the old one;
it makes the local maximization strongly concave when the original Hamiltonian is not sufficiently well behaved in the control variable.

If $\varepsilon$ is too small, the method may still be unstable. If $\varepsilon$ is too large, the update is very small and convergence becomes slow. Therefore $\varepsilon$ is adapted during the iteration.

The descent test is based on the actual cost decrease. Let

\tau_k:=\|u^{k+1}-u^k\|_{L^2(Z)}^2.

(48)

For a prescribed $\eta>0$ , accept the new control if

J(y^{k+1},u^{k+1})-J(y^k,u^k) \le - \eta \tau_k.

(49)

If the test fails, increase $\varepsilon$ and solve the pointwise optimization problem again. If the test succeeds, decrease $\varepsilon$ so that the next iteration can try a less conservative update.

This is the robust successive approximation mechanism that leads directly to the SQH algorithm.

The Sequential Quadratic Hamiltonian Algorithm¶

The Sequential Quadratic Hamiltonian method uses the robust Hamiltonian

H_\varepsilon(z,y,w,u,p) = H(z,y,w,p)-\varepsilon |w-u|^2

(50)

inside a successive approximation loop.

The word “quadratic” refers to the stabilizing quadratic term in the control increment. The algorithm is still driven by the Pontryagin condition: at each iteration the new control is obtained by solving a pointwise Hamiltonian optimization problem.

Choose:

an initial control $u^0$ ;
a maximum number of iterations $k_{\max}$ ;
a stopping tolerance $\kappa>0$ ;
parameters $\varepsilon>0$ , $\sigma>1$ , $\eta>0$ , and $\zeta\in(0,1)$ .

Set

\tau>\kappa, \qquad k=0,

(51)

and compute the initial state $y^0$ from the governing model with control $u^0$ .

While

k<k_{\max} \qquad\text{and}\qquad \tau>\kappa,

(52)

perform the following steps.

Compute the adjoint $p^k$ associated with the current pair $(y^k,u^k)$ .
Determine $u^{k+1}$ by solving the pointwise optimization problem

H_\varepsilon \left( z,y^k(z),u^{k+1}(z),u^k(z),p^k(z) \right) = \max_{w\in K_{ad}(z)} H_\varepsilon \left( z,y^k(z),w,u^k(z),p^k(z) \right)

(53)

for almost every $z\in Z$ .

This is the defining local step of SQH. The variable $z$ may be $t$ , $x$ , or $(x,t)$ , depending on the control problem.

Compute the new state $y^{k+1}$ by solving the governing model with control $u^{k+1}$ .
Compute the update size

\tau:=\|u^{k+1}-u^k\|_{L^2(Z)}^2.

(54)

Check the actual cost decrease.

J(y^{k+1},u^{k+1})-J(y^k,u^k) > - \eta\tau,

(55)

then the decrease is not sufficient. Increase the regularization parameter,

\varepsilon:=\sigma\varepsilon,

(56)

and return to Step 2 with the same $(y^k,u^k,p^k)$ .

If instead

J(y^{k+1},u^{k+1})-J(y^k,u^k) \le - \eta\tau,

(57)

then accept the update, decrease the regularization parameter,

\varepsilon:=\zeta\varepsilon,

(58)

and continue.

k:=k+1.

(59)

The loop stops when the control update is smaller than the tolerance or when the maximum number of iterations is reached.

Convergence Theorem¶

The SQH method is a successive approximation scheme with an augmented Hamiltonian. The role of the adaptive parameter $\varepsilon$ is to guarantee a sufficient decrease of the cost functional. In Step 2, an exact pointwise maximization is convenient, but the convergence mechanism only needs a sufficient, possibly partial, Hamiltonian improvement.

The following theorem records the key descent estimate. We state it without proof.

Theorem. Let

(y^k,u^k) \qquad\text{and}\qquad (y^{k+1},u^{k+1})

(60)

be generated by the SQH algorithm, and assume that $u^k$ and $u^{k+1}$ are measurable. Under appropriate smoothness, boundedness, and stability assumptions on the state equation and on the running cost $\ell$ , there exists a constant

\theta>0

(61)

independent of $\varepsilon>0$ such that, for the value of $\varepsilon$ currently chosen by the SQH algorithm,

J(y^{k+1},u^{k+1}) - J(y^k,u^k) \le - (\varepsilon-\theta) \|u^{k+1}-u^k\|_{L^2(Z)}^2.

(62)

In particular, if

\varepsilon\ge \theta+\eta,

(63)

and

\tau=\|u^{k+1}-u^k\|_{L^2(Z)}^2,

(64)

then

J(y^{k+1},u^{k+1}) - J(y^k,u^k) \le - \eta\tau.

(65)

This is exactly the sufficient-decrease condition used in Step 5 of the algorithm. Therefore, if a trial update does not decrease the cost enough, increasing $\varepsilon$ eventually makes the Hamiltonian update conservative enough to satisfy the acceptance test.

The algorithm can therefore be summarized in one sentence:

SQH alternates global state-adjoint solves with a pointwise maximization of a quadratically regularized Hamiltonian.

This is the essential distinction from a generic Newton method. Newton linearizes the full KKT system; SQH uses the Pontryagin structure to turn the control step into a local Hamiltonian optimization problem, while the PDE coupling remains in the state and adjoint equations.

Course Summary: Algorithmic Map¶

We close the course with a compact map of the main algorithmic families we have seen. The table is deliberately practical: it compares what is solved at each iteration, the dominant computational cost, and the kind of admissible sets each method naturally handles.

Here “one PDE solve” means one elliptic solve in stationary problems, or one full forward or backward time march in parabolic problems.

Method	Formulation	Typical cost per iteration	Advantages	Limitations	Admissible sets and nonsmoothness
Reduced gradient descent	Reduced: optimize $j(u)=J(S(u),u)$	One state solve + one adjoint solve; line search adds extra state solves	Simple, robust, low memory, easy to implement	Often slow; sensitive to scaling; first-order only	Best for smooth convex $U_{ad}=U$ ; can handle simple constraints only through projection or penalties
Armijo / Wolfe line search	Globalization layer for reduced methods	Several trial cost evaluations, hence extra state solves	Gives reliable decrease; stabilizes gradient, CG, BFGS	Can dominate cost if each trial requires a PDE solve	Works for smooth problems; projected variants needed for constrained sets
Nonlinear conjugate gradient	Reduced	One gradient evaluation per accepted step, plus line search	Better than steepest descent with little extra memory	Less robust on strongly nonlinear or nonsmooth problems	Mostly smooth $U_{ad}=U$ or simple projected variants
BFGS / L-BFGS	Reduced quasi-Newton	One gradient evaluation + line search; stores curvature pairs	Often much faster than gradient descent; no exact Hessian	Needs smoothness and good line search; curvature updates can fail near nonsmooth active sets	Smooth problems; L-BFGS-B-style variants for boxes, but nonsmooth terms need splitting
Reduced Newton / trust region	Reduced second-order	Hessian or Hessian-vector products; each product may require incremental state/adjoint solves	Fast local convergence; good for nonlinear smooth problems	More complex; expensive linear algebra; needs globalization	Smooth nonconvex problems if Hessian model and globalization are adequate; constraints require SQP/TR machinery
All-at-once KKT solve	Simultaneous state-adjoint-control system	One large saddle-point linear solve for linear-quadratic problems	Solves linear-quadratic unconstrained problems in one shot; exposes block structure	Indefinite systems; preconditioning is essential	Natural for equality-constrained smooth problems; inequalities need complementarity or active sets
Projected gradient	Reduced constrained	One state + one adjoint + pointwise projection; line search may add state solves	Very simple for box constraints; active set visible through saturation	First-order convergence; step-size dependent	Excellent for closed convex simple sets such as boxes; not suitable for nonconvex $U_{ad}$ without modifications
Primal-dual active set (PDAS)	KKT/complementarity	Active-set prediction + constrained KKT solve per iteration	Often finite or very fast active-set convergence; natural for box constraints	Requires good active-set logic and linear solvers; can oscillate without safeguards	Very good for convex box constraints and complementarity systems; equivalent to semismooth Newton in many cases
Semismooth Newton	Nonsmooth equation / generalized derivative	Linearized generalized KKT solve per iteration	Superlinear local convergence for structured nonsmoothness	More technical; needs semismooth reformulation and active-set identification	Excellent for projections, max/min, $L^1$ terms, complementarity; usually assumes convex structure
Subgradient descent	Reduced nonsmooth	One state + one adjoint + choice of subgradient	Most elementary nonsmooth method; conceptually robust	Very slow; difficult step-size tuning; subgradient may be nonunique	Handles convex nonsmooth functionals such as $L^1$ , but rarely the best computational choice
Proximal gradient / forward-backward splitting	Reduced composite $g(u)+\psi(u)$	One gradient evaluation for $g$ + local proximal map for $\psi$	Exploits exact nonsmooth structure; cheap local updates; natural for sparsity	Still first-order; needs proximal map and step-size control	Excellent for convex nonsmooth terms such as $L^1$ and boxes when prox/projection is explicit
Sparse-control PDAS	Slack-variable or multiplier KKT	Active-set update + linear solve; often few iterations near solution	Captures sparsity pattern sharply; much faster than subgradient methods	More implementation effort; active sets can be delicate	Strong for convex $L^1$ -type sparsity and box constraints; relies on complementarity structure
One-shot Newton KKT for nonlinear/inverse problems	Simultaneous nonlinear KKT	Assemble residual/Jacobian + large Newton correction; line search/damping may add residual evaluations	Treats state, adjoint, and parameter together; powerful for inverse problems	Nonlinear saddle-point systems; preconditioning and globalization are hard	Smooth nonconvex problems possible, but only local; boxes require PDAS or projection safeguards
SQP	Full-space or reduced constrained optimization	Quadratic subproblem with linearized constraints; cost depends on subproblem solve	General framework for smooth constrained nonlinear problems	Heavy machinery; globalization and Hessian approximation matter	Handles smooth convex constraints well; nonconvex constraints possible but only with local guarantees
SQH	Pontryagin / Hamiltonian successive approximation	One adjoint solve + pointwise Hamiltonian maximization + one state solve; repeated local solve if $\varepsilon$ changes	Uses PMP structure; control update is local; adaptive $\varepsilon$ guarantees sufficient decrease	Depends on Hamiltonian structure; still local for nonconvex problems; theory needs smoothness/stability assumptions	Very flexible for pointwise admissible sets $K_{ad}(z)$ , including nonconvex or finite-valued sets, because Step 2 is a local optimization

The main dividing lines are these:

Reduced vs all-at-once. Reduced methods make every iteration look like an optimization step in the control space, but every gradient hides state and adjoint solves. All-at-once methods expose the KKT structure directly and shift the difficulty to saddle-point linear algebra.
First-order vs second-order. First-order methods are robust and cheap per iteration; second-order methods are expensive per iteration but can be much faster near the solution.
Smooth vs nonsmooth. If nonsmoothness is simple and convex, proximal and active-set methods exploit it directly. If nonsmoothness comes from complementarity, semismooth Newton and PDAS are the natural tools.
Convex vs nonconvex admissible sets. Projection and proximal methods are cleanest for convex sets. SQH is unusual because its control step is a pointwise optimization over $K_{ad}(z)$ , so finite-valued or nonconvex pointwise admissible sets can be handled locally, although only local convergence and descent guarantees should be expected for the full problem.

As a rule of thumb:

start with a reduced gradient or projected-gradient method when building a new code;
move to proximal methods when the objective contains a known convex nonsmooth term;
use PDAS or semismooth Newton when the active set is the main structure;
use Newton, SQP, or one-shot KKT methods when second-order information is worth the linear algebra cost;
use SQH when the Pontryagin Hamiltonian gives a cheap and meaningful pointwise control update.