A paper on optimization of a Lipschitz continuous function. Algorithms provided. Useful if the evaluation of function is expensive.

Definition of Lipschitz continuity: Function $f:X \mapsto Y$ is Lipschitz continuous if

where $d_X$ and $d_Y$ are metric on $X$ and $Y$ respectively. For example, $f:\mathbb{R}^2\mapsto\mathbb{R}$ is Lipschitz continuous if

## Set up

• $f: \mathcal{X}\mapsto\mathbb{R}$ defined on a compact convex set $\mathcal{X}\subset\mathbb{R}^d$
• Global optimization $X^{\ast} \in \arg\max_{X\in\mathcal{X}} f(X)$
• Search algorithm: produces sequence $(X_1,f(X_1)), \ldots, (X_t,f(X_t))$ and selection of $X_{t+1}$ depends only on $(X_i,f(X_i))\ \forall i\le t$
• Performance metric of algorithm: speed of convergence $\max_{X\in\mathcal{X}} f(X) - \max_{i=1,\ldots,t} f(X_i)$
• Function has finite Lipschitz constant:

## Notations

• $X = (x_1, x_2, \ldots, x_d) \in \mathbb{R}^d$, i.e. dimension $d$
• Standard $L_2$-norm is used:
$||X||_2^2 = \sum_{i=1}^d x_i^2$
• Ball centered at $X$ with radius $r$:
$B(X,r) = \{X'\in\mathbb{R}^d: ||X-X'||_2 \le r\}$
• Inner radius of $\mathcal{X}\subset \mathbb{R}^d$:
$\mathrm{rad}(\mathcal{X}) = \max\{r>0:B(x,r)\subseteq\mathcal{X}\quad\exists x\in\mathcal{X}\}$
• Diameter of $\mathcal{X}\subset\mathbb{R}^d$:
$\mathrm{diam}(\mathcal{X}) = \max_{x,x'\in\mathcal{X}} ||x-x'||_2$
• Volume of $\mathcal{X}$ defined by by Lebesgue measure
• Set of $k$-Lipschitz functions defined on $\mathcal{X}$:
$\mathrm{Lip}(k) = \{f: \mathcal{X}\mapsto\mathbb{R}\quad\textrm{s.t.}\quad |f(x)-f(x')|\le k||x-x'||_2\quad\forall x,x'\in\mathcal{X}\}$
• Set of all Lipschitz continusous functions: $\bigcup_{k\ge 0}\textrm{Lip}(k)$
• Uniform distribution: $\mathcal{U}(\mathcal{X})$

## Definitions and propositions

Optimization consistency of algorithm $A$, over set of functions $\mathcal{F}$, which generates $X_1, X_2, \ldots, X_n$:

Necessary and sufficient condition for consistency over the set of Lipschitz functions (also guarantee for global maximizer):

Convergence of pure random search: With probability no less than $1-\delta$,

Minimax rate of algorithm $A$ over $k$-Lipschitz function $f$ according to Bull (2011): $\forall n\in\mathbb{N}$,

where

and $\mathbb{E}[\ ]$ is expectation over the $n$ evaluation points $X_1,\ldots,X_n$. This suggests the convergence rate at $\Theta(n^{-1/d})$.

## Algorithms

LIPO algorithm: for function with Lipschitz constant $k$ is known

• Input: $n\in\mathbb{N}, k\ge 0, \mathcal{X}\subset\mathbb{R}^d, f\in\mathrm{Lip}(k)$
• Procedure
1. Let $X_1 \sim \mathcal{U}(\mathcal{X})$
2. Set $t:=1$, evaluate $f(X_1)$
3. While $% $:
1. Let $X_{t+1} \sim \mathcal{U}(\mathcal{X})$
2. If $\min_{i=1,\ldots,t}\left(f(X_i)+k||X_{t+1}-X_i||_2\right) \ge \max_{i=1,\ldots,t} f(X_i)$
1. Evaluate $f(X_{t+1})$
2. Update $t := t+1$
• Output: $X_i,\ i\in\arg\max_{i=1,\ldots,n} f(X_i)$

The if-statement inside while loop is the decision rule to accept only those $X_{t+1}$ that has the potential to maximize $f$. The RHS is the global maximum encountered so far and LHS searches for the Lipschitz bound. It is provided as a lemma in the paper that $X_{t+1}$ is a maximizer iff the inequality satisfies.

Adaptive LIPO for unknow Lipschitz constant:

• Input: $n\in\mathbb{N}, k_{i\in\mathbb{Z}}, \mathcal{X}\subset\mathbb{R}^d, f\in\bigcup_{k\ge 0}\mathrm{Lip}(k)$
• Procedure
1. Let $X_1 \sim \mathcal{U}(\mathcal{X})$
2. Set $k_1:=0$ and $t:=1$, evaluate $f(X_1)$
3. While $% $:
1. Draw a Bernoulli random variable $B_{t+1}\in\{0,1\}$ with probability $p$
2. If $B_{t+1} = 1$, it is the exploration, $X_{t+1}\sim\mathcal{U}(\mathcal{X})$
3. If $B_{t+1} = 0$, it is the exploitation, $X_{t+1}\sim\mathcal{U}(\mathcal{X}_{k_t,t})$
4. Evaluate $f(X_{t+1})$
5. Update $t := t+1$
6. Update $k_t = \inf\{k_{i\in\mathbb{Z}}:\max_{i\neq j} \frac{|f(X_i)-f(X_j)|}{||X_i-X_j||_2} \le k_i\}$
• Output: $X_i,\ i\in\arg\max_{i=1,\ldots,n} f(X_i)$

The Lipschitz constant estimate at each step is $k_t$. The set $\mathcal{X}_{k,t}$ is the set of potential maximizer, i.e., those $x\in\mathcal{X}$ that statisfied the inequality above, i.e.,

## Note

A naive implementation would not be any faster than Monte Carlo search if the inequality has to be evaluated every time. A more efficient approach would be to modify the random function to draw exactly those $X\in\mathcal{X}_{k,t}$. This would be depend on shape of the function domain.

(PDF)

## Bibliographic data

@inproceedings{
title = "Global Optimization of Lipschitz Functions",
author = "Cedric Malherbe and Nicolas Vayatis",
year = "2017",
booktitle = "Proceedings of the 34th International Conference on Machine Learning",
address = "Sydney, Australia",
}