Nonsmooth PDE-Constrained Optimization

In the previous lectures we focused mainly on smooth PDE-constrained optimization, where the reduced functional is differentiable and the optimality system can be handled with standard Newton or SQP techniques.

Many relevant applications, however, are inherently nonsmooth. The nonsmoothness may enter through the cost functional, through inequality constraints, or through the state equation itself.

Typical examples include:

sparse optimal control with an $L^1$ penalty;
pointwise state constraints;
obstacle-type systems and variational inequalities;
complementarity systems and active-set formulations.

The goal of this lecture is to introduce the mathematical structure and the numerical treatment of these nonsmooth PDE-constrained optimization problems.

We focus on three model classes:

sparse optimal control with $L^1$ regularization;
pointwise state constraints;
variational inequality constraints.

The presentation follows Chapter 6 of De los Reyes and connects naturally with the KKT, PDAS, and semismooth Newton ideas developed in the previous lectures.

Throughout the lecture, let $\Omega\subset\mathbb R^d$ be a bounded Lipschitz polyhedral domain.

Sources of Nonsmoothness¶

There are three main sources of nonsmoothness in PDE-constrained optimization.

Functional nonsmoothness¶

The objective functional may contain nonsmooth terms.

Typical examples are

\|u\|_{L^1(\Omega)}

(1)

for sparse control, or

\|y-y_d\|_{L^1(\Omega)}

(2)

for robust data fitting.

Constraint nonsmoothness¶

The admissible set may contain inequality constraints such as

u_a\le u\le u_b

(3)

y\le y_b.

(4)

These constraints generate complementarity systems.

Structural nonsmoothness¶

The PDE itself may be nonsmooth.

Examples include:

obstacle problems;
contact mechanics;
friction laws;
variational inequalities;
upwind discretizations.

Why Nonsmooth Optimization?¶

In classical linear-quadratic control we minimize

J(y,u) = \frac12\|y-y_d\|_{L^2(\Omega)}^2 + \frac\alpha2\|u\|_{L^2(\Omega)}^2

(5)

subject to

Ay=u.

(6)

The quadratic control term is smooth and strictly convex.

As a consequence:

the reduced functional is differentiable;
the optimality system is smooth;
Newton methods are natural.

However, many applications require structural properties that are not promoted by an $L^2$ penalty.

For instance:

controls acting only on small regions;
bang-bang controls;
switching devices;
actuators that can be turned on and off.

A standard mechanism for promoting sparsity is replacing the quadratic control penalty by an $L^1$ term:

\beta \|u\|_{L^1(\Omega)}.

(7)

The $L^1$ norm is convex but not differentiable.

Its subdifferential contains set-valued regions:

\partial |x| = \begin{cases} \{1\}, & x>0,\\ [-1,1], & x=0,\\ \{-1\}, & x<0. \end{cases}

(8)

The loss of smoothness fundamentally changes:

the optimality system;
the interpretation of multipliers;
the numerical algorithms.

Another source of nonsmoothness comes from variational inequalities.

Obstacle problems naturally lead to complementarity systems of the form

y\ge \psi, \qquad Ay-f\ge 0, \qquad (y-\psi)(Ay-f)=0.

(9)

The active set changes discontinuously.

Consequently the solution operator is typically not Fréchet differentiable.

Nevertheless, these problems still possess strong structure that can be exploited numerically.

Sparse Optimal Control¶

Problem formulation¶

Consider the elliptic control problem

\min_{(y,u)} J(y,u) := \frac12\|y-y_d\|_{L^2(\Omega)}^2 + \frac\alpha2\|u\|_{L^2(\Omega)}^2 + \beta\|u\|_{L^1(\Omega)}

(10)

subject to

\begin{cases} -\Delta y = u+f & \text{in }\Omega,\\ y=0 & \text{on }\partial\Omega. \end{cases}

(11)

Here:

$\alpha>0$ is the quadratic regularization parameter;
$\beta>0$ controls sparsity.

The admissible set may additionally include box constraints:

U_{ad} = \{u\in L^2(\Omega): u_a\le u\le u_b\}.

(12)

The state equation defines a bounded linear control-to-state map

S:L^2(\Omega)\to H_0^1(\Omega)\cap H^2(\Omega), \qquad y=S(u).

(13)

The reduced functional becomes

f(u) = \frac12\|S(u)-y_d\|_{L^2(\Omega)}^2 + \frac\alpha2\|u\|_{L^2(\Omega)}^2 + \beta\|u\|_{L^1(\Omega)}.

(14)

The nonsmooth term is

\|u\|_{L^1(\Omega)} = \int_\Omega |u(x)|\,dx.

(15)

Existence of solutions¶

The existence proof follows the direct method of the calculus of variations.

The functional is:

proper;
convex;
coercive on $L^2(\Omega)$ ;
weakly lower semicontinuous.

Indeed,

f(u) \ge \frac\alpha2\|u\|_{L^2(\Omega)}^2.

(16)

Thus every minimizing sequence is bounded in $L^2(\Omega)$ .

Using weak compactness, we obtain

u_n \rightharpoonup \bar u \qquad \text{in }L^2(\Omega).

(17)

Since the state equation is linear and continuous,

S(u_n)\rightharpoonup S(\bar u).

(18)

Convexity and lower semicontinuity of the $L^1$ norm imply

\|\bar u\|_{L^1} \le \liminf_{n\to\infty} \|u_n\|_{L^1}.

(19)

Hence $\bar u$ is optimal.

If $\alpha>0$ , strict convexity of the quadratic term yields uniqueness.

Subdifferentials¶

The classical derivative is no longer sufficient and we replace it with the notion of subdifferential, which we recall here:

Let $X$ be a Banach space and let

\Phi:X\to \mathbb R

(20)

be convex.

The subdifferential of $\Phi$ at $u$ is

\partial \Phi(u) = \left\{ \lambda\in X^*: \Phi(v)\ge \Phi(u)+\langle \lambda,v-u\rangle \ \forall v\in X \right\}.

(21)

Elements of $\partial\Phi(u)$ are called subgradients.

If $\Phi$ is differentiable, then

\partial\Phi(u)=\{\Phi'(u)\}.

(22)

Thus the subdifferential generalizes the derivative.

Subdifferential of the $L^1$ norm¶

Define

\Phi(u)=\|u\|_{L^1(\Omega)}.

(23)

Then

\lambda\in \partial \Phi(u)

(24)

if and only if

\lambda(x) = \begin{cases} 1, & u(x)>0,\\ [-1,1], & u(x)=0,\\ -1, & u(x)<0, \end{cases}

(25)

almost everywhere.

Equivalently,

\lambda\in L^\infty(\Omega), \qquad \|\lambda\|_{L^\infty}\le 1,

(26)

and

\lambda(x)=\operatorname{sign}(u(x)) \quad \text{where }u(x)\ne 0.

(27)

The subdifferential is therefore set-valued exactly at points where $u=0$ .

This is the mathematical mechanism that promotes sparsity.

Optimality Conditions for Sparse Control¶

Let

\bar u

(28)

be an optimal control and

\bar y=S(\bar u).

(29)

The adjoint state satisfies

\begin{cases} -\Delta \bar p = \bar y-y_d & \text{in }\Omega,\\ \bar p=0 & \text{on }\partial\Omega. \end{cases}

(30)

The reduced smooth part has derivative

f_s'(u)v = (\alpha u+p,v)_{L^2(\Omega)}.

(31)

The optimality condition becomes

0 \in \alpha \bar u + \bar p + \beta\partial \|\bar u\|_{L^1}.

(32)

Hence there exists

\bar \lambda\in \partial\|\bar u\|_{L^1}

(33)

such that

\alpha \bar u + \bar p + \beta \bar\lambda =0.

(34)

Pointwise,

\bar u(x)=0

(35)

whenever

|\bar p(x)|\le \beta.

(36)

This is the key sparsity relation. Notice that $\alpha$ does not appear in this threshold condition: when $\bar u(x)=0$ , the term $\alpha \bar u(x)$ vanishes, and the inclusion reduces to the requirement that $-\bar p(x)\in \beta[-1,1]$ .

The adjoint variable directly determines the inactive region.

A Hierarchy of Numerical Methods¶

The optimality condition

0 \in \alpha \bar u + \bar p + \beta\partial \|\bar u\|_{L^1}

(37)

already suggests a hierarchy of numerical methods of increasing complexity.

At the most basic level, one can work directly with subgradients and obtain globally defined first-order iterations. A more structured strategy is to exploit the splitting between the smooth reduced functional and the nonsmooth $L^1$ term, leading to proximal algorithms. Finally, if one fully exploits the piecewise smooth structure of the optimality system, one arrives at semismooth Newton and primal-dual active set methods, which are more involved but locally much faster.

This gives the progression:

subgradient descent: simplest and most robust, but slowest;
proximal methods: still first-order, but much more effective for $L^1$ terms;
semismooth Newton / PDAS: most structured and locally most efficient.

Subgradient Descent¶

The simplest way to attack the reduced problem is to use the subdifferential directly.

Let

f(u)=g(u)+\beta\|u\|_{L^1(\Omega)},

(38)

where

g(u) = \frac12\|S(u)-y_d\|_{L^2(\Omega)}^2 + \frac\alpha2\|u\|_{L^2(\Omega)}^2.

(39)

Since $g$ is differentiable and $\|u\|_{L^1}$ is convex, a subgradient of $f$ at $u_k$ is

\xi_k = \nabla g(u_k)+\beta\lambda_k, \qquad \lambda_k\in \partial \|u_k\|_{L^1}.

(40)

The subgradient iteration reads

u_{k+1}=u_k-\tau_k \xi_k.

(41)

For the sparse control problem this becomes

u_{k+1} = u_k-\tau_k\bigl(\alpha u_k+p_k+\beta\lambda_k\bigr),

(42)

where $p_k$ is the adjoint state associated with $u_k$ .

The method is easy to state and globally meaningful in the convex setting, but it has two important drawbacks:

the subgradient is not unique when $u_k$ vanishes on a set of positive measure;
the convergence is typically slow, since one only expects first-order behavior and usually sublinear rates.

For this reason, subgradient descent is conceptually important, but in sparse PDE-constrained optimization it is often outperformed by proximal methods.

Proximal Point Method¶

Instead of following an arbitrary subgradient, one can regularize the nonsmooth problem at each step by adding a quadratic term. This is the idea of the proximal point method.

Consider a convex functional

\Phi:U\to \mathbb R\cup\{+\infty\}.

(43)

Given $\tau_k>0$ and an iterate $u_k$ , the proximal point method defines $u_{k+1}$ as the minimizer of

u_{k+1} \in \operatorname*{argmin}_{u\in U} \left\{ \Phi(u) + \frac{1}{2\tau_k}\|u-u_k\|_U^2 \right\}.

(44)

The quadratic term makes the subproblem strongly convex and stabilizes the iteration.

The optimality condition for the subproblem is

0\in \partial \Phi(u_{k+1}) + \frac{1}{\tau_k}(u_{k+1}-u_k),

(45)

or equivalently

u_k-u_{k+1}\in \tau_k \partial \Phi(u_{k+1}).

(46)

This motivates the definition of the proximal map:

\operatorname{prox}_{\tau \Phi}(v) := \operatorname*{argmin}_{u\in U} \left\{ \Phi(u)+\frac{1}{2\tau}\|u-v\|_U^2 \right\}.

(47)

With this notation, the proximal point iteration reads

u_{k+1}=\operatorname{prox}_{\tau_k\Phi}(u_k).

(48)

Compared with subgradient descent, this is more implicit and usually more stable. However, if applied directly to the full reduced functional, each step may still be expensive because it involves a nontrivial nonsmooth minimization subproblem.

Proximal Gradient Method¶

The sparse control problem has the structure

f(u)=g(u)+h(u),

(49)

where

g(u) = \frac12\|S(u)-y_d\|_{L^2(\Omega)}^2 + \frac\alpha2\|u\|_{L^2(\Omega)}^2

(50)

is smooth, while

h(u)=\beta\|u\|_{L^1(\Omega)}

(51)

is convex but nonsmooth.

This splitting leads naturally to the proximal gradient method, also known as forward-backward splitting.

Given a step size $\tau_k>0$ , one first takes a gradient step for the smooth part and then a proximal step for the nonsmooth part:

u_{k+1} = \operatorname{prox}_{\tau_k h} \bigl( u_k-\tau_k \nabla g(u_k) \bigr).

(52)

For the reduced sparse control problem, the gradient of the smooth part is

\nabla g(u)=\alpha u + p(u),

(53)

where $p(u)$ is the adjoint state associated with $u$ .

Hence the iteration becomes

u_{k+1} = \operatorname{prox}_{\tau_k \beta \|\cdot\|_{L^1}} \bigl( u_k-\tau_k(\alpha u_k+p_k) \bigr).

(54)

The crucial point is that the proximal map of the $L^1$ norm is explicit and coincides with soft-thresholding:

\operatorname{prox}_{\tau\beta \|\cdot\|_{L^1}}(v)(x) = \operatorname{soft}(v(x),\tau\beta),

(55)

where

\operatorname{soft}(s,\eta) = \begin{cases} s-\eta, & s>\eta,\\ 0, & |s|\le \eta,\\ s+\eta, & s<-\eta. \end{cases}

(56)

Therefore each proximal gradient step consists of:

solving the state equation and the adjoint equation to compute $\nabla g(u_k)$ ;
taking a gradient step for the smooth reduced functional;
applying pointwise soft-thresholding.

Unlike plain subgradient descent, proximal gradient methods exploit the exact structure of the nonsmooth term and are therefore much more effective for sparse control problems. They remain first-order methods, so they are robust and relatively easy to implement, but their convergence is still slower than that of Newton-type methods.

Projection Formula and Soft Thresholding¶

Suppose there are no box constraints.

Then the optimality condition yields the explicit formula

\bar u(x) = -\frac1\alpha \operatorname{soft}(\bar p(x),\beta),

(57)

where

\operatorname{soft}(p,\beta) = \begin{cases} p-\beta, & p>\beta,\\ 0, & |p|\le \beta,\\ p+\beta, & p<-\beta. \end{cases}

(58)

This is called the soft-thresholding operator.

Observe the contrast with classical quadratic control:

\bar u=-\frac1\alpha \bar p.

(59)

The $L^1$ term creates a dead zone:

|p|\le \beta \implies u=0.

(60)

The parameter $\alpha$ acts outside this dead zone: it does not determine whether the control vanishes, but it scales the magnitude of $u$ when $|p|>\beta$ .

Hence large portions of the domain may contain exactly vanishing controls.

Semismoothness¶

The mapping

\operatorname{soft}(p,\beta)

(61)

is not differentiable in the classical sense.

Nevertheless, it is semismooth.

Semismoothness is weaker than Fréchet differentiability but strong enough to obtain superlinear Newton convergence. We recall the semismooth Newton iteration that reads:

G_k \delta u_k = -F(u_k),

(62)

where

G_k\in \partial_C F(u_k).

(63)

Then

u_{k+1}=u_k+\delta u_k.

(64)

This is a much more sophisticated strategy than subgradient or proximal methods. It requires more structure, but in return it offers local superlinear convergence once the iteration enters the correct regime.

For sparse control problems, semismooth Newton methods are most naturally derived from an equivalent slack-variable reformulation of the $L^1$ term.

Slack-Variable Reformulation¶

Introduce an auxiliary variable $w\in L^2(\Omega)$ and rewrite the sparse control problem as

\min_{(y,u,w)} \frac12\|y-y_d\|_{L^2(\Omega)}^2 + \frac\alpha2\|u\|_{L^2(\Omega)}^2 + \beta\int_\Omega w\,dx

(65)

subject to

\begin{cases} -\Delta y = u+f & \text{in }\Omega,\\ y=0 & \text{on }\partial\Omega, \end{cases}

(66)

and the pointwise inequalities

u-w\le 0, \qquad -u-w\le 0.

(67)

These constraints are equivalent to $|u|\le w$ , so at the optimum one must have $w=|u|$ and the reformulated problem is equivalent to the original one. Indeed, for any fixed pair $(y,u)$ satisfying the state equation, the variable $w$ appears in the objective only through the linear term

\beta\int_\Omega w\,dx,

(68)

with $\beta>0$ . Therefore the minimization always tries to make $w$ as small as possible. But the constraints require

w(x)\ge |u(x)| \qquad \text{for a.e. }x\in\Omega.

(69)

Hence the smallest admissible choice is precisely

w(x)=|u(x)| \qquad \text{for a.e. }x\in\Omega.

(70)

If on a set of positive measure one had $w>|u|$ , then one could decrease $w$ on that set without violating the constraints, strictly reduce the objective, and thus contradict optimality.

The advantage is that the nonsmooth $L^1$ term has disappeared from the objective and has been replaced by smooth constraints with complementarity conditions.

KKT System for the Slack Formulation¶

For the reformulated problem, the Lagrangian is

\begin{split} \mathcal L(y,u,w,p,\lambda^+,\lambda^-) = & \frac12\|y-y_d\|_{L^2(\Omega)}^2 + \frac\alpha2\|u\|_{L^2(\Omega)}^2 + \beta\int_\Omega w\,dx \\ & - \int_\Omega \nabla y\cdot\nabla p\,dx + \int_\Omega (u+f)p\,dx \\ & + \int_\Omega \lambda^+(u-w)\,dx + \int_\Omega \lambda^-(-u-w)\,dx . \end{split}

(71)

Here:

$p$ is the adjoint variable for the state equation;
$\lambda^+\ge 0$ and $\lambda^-\ge 0$ are the multipliers associated with the inequalities $u-w\le 0$ and $-u-w\le 0$ .

The KKT conditions are

\begin{cases} -\Delta y = u+f,\\ -\Delta p = y-y_d,\\ \alpha u + p + \lambda^+ - \lambda^- = 0,\\ \beta - \lambda^+ - \lambda^- = 0,\\ \lambda^+\ge 0,\quad \lambda^-\ge 0,\\ u-w\le 0,\quad -u-w\le 0,\\ \lambda^+(u-w)=0,\quad \lambda^-(-u-w)=0. \end{cases}

(72)

This is now a genuine complementarity system. In particular, the control stationarity condition is an equality, and the multipliers are classical KKT multipliers.

Interpretation of the Complementarity Conditions¶

The slack formulation makes the structure of the sparse solution transparent.

If $u>0$ , then necessarily $w=u$ , the constraint $u-w\le 0$ is active, while $-u-w<0$ is inactive. Hence

\lambda^+=\beta, \qquad \lambda^-=0,

(73)

and therefore

\alpha u + p + \beta = 0.

(74)

If $u<0$ , then $w=-u$ , the second constraint is active, and one gets

\lambda^+=0, \qquad \lambda^-=\beta,

(75)

so that

\alpha u + p - \beta = 0.

(76)

If $u=0$ , then $w=0$ and both constraints are active. In this case

\lambda^+ + \lambda^- = \beta,

(77)

and

p = -(\lambda^+ - \lambda^-),

(78)

which implies

|p|\le \beta.

(79)

Thus the slack-variable KKT system recovers exactly the same threshold condition as the subdifferential formulation:

u=0 \qquad \Longleftrightarrow \qquad |p|\le \beta.

(80)

Semismooth Newton and PDAS for the Slack Formulation¶

The KKT system above is well suited for semismooth Newton and primal-dual active set methods because the nonsmoothness is now entirely encoded in the complementarity relations.

It is convenient to work with the three regions

\mathcal P := \{x:u(x)>0\}, \qquad \mathcal N := \{x:u(x)<0\}, \qquad \mathcal Z := \{x:u(x)=0\}.

(81)

On these regions, the KKT conditions reduce to simple relations:

\lambda^+=\beta,\ \lambda^-=0 \quad \text{on }\mathcal P,

(82)

\lambda^+=0,\ \lambda^-=\beta \quad \text{on }\mathcal N,

(83)

and

u=0,\ w=0,\ \lambda^+ + \lambda^-=\beta \quad \text{on }\mathcal Z.

(84)

A PDAS iteration then proceeds by:

identifying the current positive, negative, and zero regions;
freezing the corresponding complementarity relations on each region;
solving the resulting linear state-adjoint-control-multiplier system.

This is exactly the active-set counterpart of a semismooth Newton step for the slack-variable KKT system. The method is more elaborate than proximal methods, but it exploits the complementarity structure directly and therefore achieves local superlinear convergence under suitable assumptions.

Direct Dual Multiplier Formulation¶

There is another convenient way to write the sparse optimality system, closer to the earlier subdifferential formulation and more compact than the slack-variable KKT system.

Starting from

0\in \alpha u + p + \beta \partial \|u\|_{L^1},

(85)

we introduce the dual variable

\lambda \in \beta \partial \|u\|_{L^1}.

(86)

Then the control stationarity condition becomes simply

\alpha u + p + \lambda = 0.

(87)

Pointwise, this means

\lambda(x)\in \beta \partial |u(x)| = \begin{cases} \{\beta\}, & u(x)>0,\\ [-\beta,\beta], & u(x)=0,\\ \{-\beta\}, & u(x)<0. \end{cases}

(88)

Equivalently,

|\lambda|\le \beta,

(89)

and

u>0 \Longrightarrow \lambda=\beta, \qquad u<0 \Longrightarrow \lambda=-\beta, \qquad |\lambda|<\beta \Longrightarrow u=0.

(90)

This formulation is exactly the same information as in the earlier sections:

in the subdifferential formulation, one writes $0\in \alpha u + p + \beta\partial\|u\|_{L^1}$ ;
here one absorbs the factor $\beta$ into the dual variable and writes $\alpha u + p + \lambda = 0$ with $\lambda\in \beta\partial\|u\|_{L^1}$ ;
in the soft-thresholding formula, the condition $u=0$ for $|p|\le \beta$ is recovered by combining $\alpha u + p + \lambda=0$ with $|\lambda|\le \beta$ ;
in the slack-variable formulation, the same dual variable is represented by the difference of the two nonnegative multipliers:

\lambda = \lambda^+ - \lambda^-,

(91)

while the stationarity condition

\beta - \lambda^+ - \lambda^- = 0

(92)

implies

|\lambda| = |\lambda^+ - \lambda^-| \le \lambda^+ + \lambda^- = \beta.

(93)

Hence the direct dual formulation is a compact bridge between the two other descriptions:

the variational picture based on subdifferentials;
the complementarity picture based on slack variables and KKT multipliers.

For semismooth Newton methods, one can combine

\alpha u + p + \lambda = 0

(94)

with a semismooth characterization of the graph of $\beta\partial|\cdot|$ , for instance through projection formulas. For PDAS, the slack-variable formulation is often more convenient because the complementarity structure is completely explicit.

References¶

J.C. De los Reyes, Numerical PDE-Constrained Optimization, Springer, 2015.
F. Tröltzsch, Optimal Control of Partial Differential Equations, AMS, 2010.
M. Hintermüller, K. Ito, K. Kunisch, The primal-dual active set strategy as a semismooth Newton method.
M. Hinze, R. Pinnau, M. Ulbrich, S. Ulbrich, Optimization with PDE Constraints.
A. Manzoni, A. Quarteroni, S. Salsa, Optimal Control of Partial Differential Equations, Springer, 2021.

Sources of Nonsmoothness¶

Functional nonsmoothness¶

Constraint nonsmoothness¶

Structural nonsmoothness¶

Why Nonsmooth Optimization?¶

Sparse Optimal Control¶

Problem formulation¶

Existence of solutions¶

Subdifferentials¶

Subdifferential of the L1L^1L1 norm¶

Optimality Conditions for Sparse Control¶

A Hierarchy of Numerical Methods¶

Subgradient Descent¶

Proximal Point Method¶

Proximal Gradient Method¶

Projection Formula and Soft Thresholding¶

Semismoothness¶

Slack-Variable Reformulation¶

KKT System for the Slack Formulation¶

Interpretation of the Complementarity Conditions¶

Semismooth Newton and PDAS for the Slack Formulation¶

Direct Dual Multiplier Formulation¶

References¶

Subdifferential of the $L^1$ norm¶