Solving Sudoku with Simulated Annealing

Recently I encountered a problem of mixed integer programming with linear constraints but non-linear convex objective function. There do not seem to have any nice solver package available in Python. Even if I tried to look for anything with a C/C++ API, there are not many open source solutions, AFAIK. Therefore my second attempt is to solve the problem heuristically. It turns out the problem structure may allow a simple hill climbing (i.e., greedy search algorithm). However, while the objective function is convex, I cannot prove the solution domain is contiguous in all cases, hence I cannot guarantee there will not be any local minima. So play safe and use simulated annealing can be a good move.

Simulated annealing is a metaheuristic algorithm which means it depends on a handful of other parameters to work. But a simple skeleton algorithm is as follows:

def simulated_annealing(s0, k_max):
    s = s0
    for k in range(k_max):
        T = temperature(k/k_max)
        s_new = neighbour(s)
        if P(E(s), E(s_new), T) >= random.random():
            s = s_new
    return s

It starts searching with an initial solution s0 at temperature k_max, and depends on an annealing schedule temperature(r). It keeps looking for a randomly chosen neighbour(s) and accept such neighbour by a probability function P(), which in turn depends on the energy level E() of solutions. The acceptance probability function (e.g., in Metropolis-Hastings algorithm) is exponential \(P(e, e', T) = \exp(-\frac{e'-e}{T})\) for \(e,e'\) to be the energy level of s and its neighbour respectively. The function P() may raise above 1 if the neighbour of s has lower energy (i.e. better) and in such case the neighbour is accepted deterministically.

Variations of algorithm

Because simulated annealing is a metaheuristic algorithm, there are many variations available. A paper, Adaptive simulated annealing (ASA): Lessons learned, gave a very concise summary.

The original simulated annealing is modeled after Boltzmann equation. Thus named as Boltzmann Annealing (BA). It has a solution modelled as a \(d\)-vector \(x=(x^1, \cdots, x^d)\) with the annealing schedule \(T(k)=T_0/\log k\) for \(k\) denote the step count. It search a solution space by the probability density

\[g_T(\Delta x) = (2\pi T)^{-d/2}\exp\left(-\frac{\Delta x^2}{2T}\right)\]

where \(\Delta x = x_{k+1} - x_k\) which usually \(x_k\) is the current solution and \(x_{k+1}\) is the neighbour. This is modelling the class of physical systems that called the Gaussian-Markovian systems. And the acceptance probability function is

\[h_T(\Delta E) = \frac{1}{1+\exp(\Delta E/T)} \approx \exp(-\Delta E/T)\]

where \(\Delta E = E_{k+1}-E_k\) and this is also the form of probability function mentioned above. The Boltzmann Annealing is characterized by the annealing schedule \(T(k)\). It is proved that with such schedule, any point in the solution space can be sampled infinitely often in annealing time, i.e., \(\lim_{k\to\infty}\sum_k g_k = \infty\), or the sampling is ergodic.

If we switch to use the exponential schedule

\[T(k) = cT(k-1) = T_0 e^{(c-1)k}\]

this will make the annealing faster but violates the assumption of ergodic sampling in BA. But people find this useful and named this as simulated quenching. Such schedule is also called the exponential cooling scheme (ECS).

Another variation is to use the Cauchy distribution for solution probability density:

\[g_T(\Delta x) = \frac{T}{(\Delta x^2 + T^2)^{(d+1)/2}}\]

which has a fatter tail than the Gaussian distribution as in BA. Therefore, it is easier to get around the local minima while searching for the global minima. However, such density allows statistically same ergodicity as BA if the annealing schedule is \(T(k) = T_0/k\)., due to

\[\lim_{k\to\infty}\sum_k g_k = \frac{T_0}{\Delta x^{d+1}}\sum_k\frac{1}{k} = \infty\]

Therefore, it is faster while similarly effective. This one is called fast annealing (FA).

Adaptive Simulated Annealing (ASA) is a further extension that allows each component \(x^i\) of the solution vector \(x\) to take a different domain, with different density function, and undergo a different annealing schedule. This expands the number of parameters to the algorithm to many folds.

Constrained optimization

The literature of simulated annealing usually do not explicitly concern about constrained optimization. One way to apply SA to constrained optimization is to implement the logic in the neighbour function so that each neighbour used is always within constrain. Another way is to implement the constrain in the objective function as penalties, e.g.

\[E^{\ast}(x) = E(x) + \frac{1}{T}\sum_i w_i c_i(x)\]

for each \(c_i(x)\) the magnitude of violation in constrain \(i\) and \(w_i\) the corresponding weight in penalty. This approach may not be a good one especially if the valid solutions are sparse.

Parameter estimation

SA algorithm is mostly controlled by the temperate and its annealing schedule.

The well-received believe is that the initial temperature should be set to such a level that there is at least 80% probability to accept an increase in the objective function. And during the last 40% of annealing time, the algorithm should search in proximity of the optimal solution. One way to find the initial temperature \(T_0\) is to estimate (or observe) the average objective increase \(\Delta E^+\) and then set \(T_0 = -\frac{\Delta E^+}{\log p_0}\) for the initial probability of acceptance (e.g., \(p_0 = 0.8\)).

The annealing schedule can also be fine-tuned such that there can be more than one neighbour searched in every temperature step \(k\). Indeed, for a fixed temperature, the search and transition between neighbours forms a Markov Chain. We can designate a maximum length \(L_k\) for a Markov Chain at step \(k\) such that we either check a total of \(L_k\) solutions or accepted \(N_{min} \le L_k\) solutions, then the temperate should be decreased for one step.

Sudoku

I found a Python package simanneal for simulated annealing. SA algorithm is simple enough that we can code our own, this package may not be an absolute necessary. However, it is simple enough to use. Let’s start with the code:

#!/usr/bin/env python

import copy
import random
from simanneal import Annealer

# from https://neos-guide.org/content/sudoku
_ = 0
PROBLEM = [
    1, _, _,  _, _, 6,  3, _, 8,
    _, _, 2,  3, _, _,  _, 9, _,
    _, _, _,  _, _, _,  7, 1, 6,

    7, _, 8,  9, 4, _,  _, _, 2,
    _, _, 4,  _, _, _,  9, _, _,
    9, _, _,  _, 2, 5,  1, _, 4,

    6, 2, 9,  _, _, _,  _, _, _,
    _, 4, _,  _, _, 7,  6, _, _,
    5, _, 7,  6, _, _,  _, _, 3,
]

def print_sudoku(state):
    border = "------+------+------"
    rows = [state[i:i+9] for i in range(0,81,9)]
    for i,row in enumerate(rows):
        if i % 3 == 0:
            print(border)
        three = [row[i:i+3] for i in range(0,9,3)]
        print(" |".join(
            " ".join(str(x or "_") for x in one)
            for one in three
        ))
    print(border)

def init_solution_row(problem):
    """Generate a random solution from a Sudoku problem
    """
    solution = []
    for i in range(0, 81, 9):
        row = problem[i:i+9]
        permu = [n for n in range(1,10) if n not in row]
        random.shuffle(permu)
        solution.extend([n or permu.pop() for n in row])
    return solution

def coord(row, col):
    return row*9 + col

class Sudoku_Row(Annealer):
    def __init__(self, problem):
        self.problem = problem
        state = init_solution_row(problem)
        super().__init__(state)
    def move(self):
        """randomly swap two cells in a random row"""
        row = random.randrange(9)
        coords = range(9*row, 9*(row+1))
        n1, n2 = random.sample([n for n in coords if self.problem[n] == 0], 2)
        self.state[n1], self.state[n2] = self.state[n2], self.state[n1]
    def energy(self):
        """calculate the number of violations: assume all rows are OK"""
        column_score = lambda n: -len(set(self.state[coord(i, n)] for i in range(9)))
        square_score = lambda n, m: -len(set(self.state[coord(3*n+i, 3*m+j)] for i in range(3) for j in range(3)))
        return sum(column_score(n) for n in range(9)) + sum(square_score(n,m) for n in range(3) for m in range(3))

def coord(row, col):
    return row*9+col

def main():
    sudoku = Sudoku_Row(PROBLEM)
    sudoku.copy_strategy = "slice"
    sudoku.steps = 1000000
    print_sudoku(sudoku.state)
    state, e = sudoku.anneal()
    print("\n")
    print_sudoku(state)
    print(e)

if __name__ == "__main__":
    main()

Instead of try with the traditional unconstrained continuous minimization problem, we go with an unnatural choice of a constrained combinatorial problem. We pick the sudoku problem from https://neos-guide.org/content/sudoku, and use the simanneal module. The implementation is easy: Just derive a class from Annealer and define two member functions move() and energy(), for neighbour search and objective function to minimize respectively. The SA algorithm should start with initializing the Annealer with an initial solution and call anneal() to get the optimal state variable and its corresponding energy level.

The above program try to find a permutation of 1 to 9 in each row and check if it comes up with a solution to the sudoku problem. Every step is to pick a row in the original state and swap two positions. The energy function is to count the number of distinct elements in each row and square block and take the negation of the sum. A perfect solution will be \(-9\times 9\times 2 = -162\). This program does not work well. It does not come up with a correct solution. So my second attempt is not to shuffle by row but by square blocks:

import copy
import random
import numpy as np
from simanneal import Annealer

# from https://neos-guide.org/content/sudoku
_ = 0
PROBLEM = np.array([
    1, _, _,  _, _, 6,  3, _, 8,
    _, _, 2,  3, _, _,  _, 9, _,
    _, _, _,  _, _, _,  7, 1, 6,

    7, _, 8,  9, 4, _,  _, _, 2,
    _, _, 4,  _, _, _,  9, _, _,
    9, _, _,  _, 2, 5,  1, _, 4,

    6, 2, 9,  _, _, _,  _, _, _,
    _, 4, _,  _, _, 7,  6, _, _,
    5, _, 7,  6, _, _,  _, _, 3,
])

def print_sudoku(state):
    border = "------+-------+------"
    rows = [state[i:i+9] for i in range(0,81,9)]
    for i,row in enumerate(rows):
        if i % 3 == 0:
            print(border)
        three = [row[i:i+3] for i in range(0,9,3)]
        print(" | ".join(
            " ".join(str(x or "_") for x in one)
            for one in three
        ))
    print(border)

def coord(row, col):
    return row*9+col

def block_indices(block_num):
    """return linear array indices corresp to the sq block, row major, 0-indexed.
    block:
       0 1 2     (0,0) (0,3) (0,6)
       3 4 5 --> (3,0) (3,3) (3,6)
       6 7 8     (6,0) (6,3) (6,6)
    """
    firstrow = (block_num // 3) * 3
    firstcol = (block_num % 3) * 3
    indices = [coord(firstrow+i, firstcol+j) for i in range(3) for j in range(3)]
    return indices

def initial_solution(problem):
    """provide sudoku problem, generate an init solution by randomly filling
    each sq block without considering row/col consistency"""
    solution = problem.copy()
    for block in range(9):
        indices = block_indices(block)
        block = problem[indices]
        zeros = [i for i in indices if problem[i] == 0]
        to_fill = [i for i in range(1, 10) if i not in block]
        random.shuffle(to_fill)
        for index, value in zip(zeros, to_fill):
            solution[index] = value
    return solution

class Sudoku_Sq(Annealer):
    def __init__(self, problem):
        self.problem = problem
        state = initial_solution(problem)
        super().__init__(state)
    def move(self):
        """randomly swap two cells in a random square"""
        block = random.randrange(9)
        indices = [i for i in block_indices(block) if self.problem[i] == 0]
        m, n = random.sample(indices, 2)
        self.state[m], self.state[n] = self.state[n], self.state[m]
    def energy(self):
        """calculate the number of violations: assume all rows are OK"""
        column_score = lambda n: -len(set(self.state[coord(i, n)] for i in range(9)))
        row_score = lambda n: -len(set(self.state[coord(n, i)] for i in range(9)))
        score = sum(column_score(n)+row_score(n) for n in range(9))
        if score == -162:
            self.user_exit = True # early quit, we found a solution
        return score

def main():
    sudoku = Sudoku_Sq(PROBLEM)
    sudoku.copy_strategy = "method"
    print_sudoku(sudoku.state)
    #sudoku.steps = 1000000
    #auto_schedule = sudoku.auto(minutes=1)
    #print(auto_schedule)
    #sudoku.set_schedule(auto_schedule)
    sudoku.Tmax = 0.5
    sudoku.Tmin = 0.05
    sudoku.steps = 100000
    sudoku.updates = 100
    state, e = sudoku.anneal()
    print("\n")
    print_sudoku(state)
    print("E=%f (expect -162)" % e)

if __name__ == "__main__":
    main()

This version is to find a neighbour by picking a square block randomly and swap two positions in it. Therefore each block is guaranteed to have 1 to 9 but we have to check each row and column for consistency. We also add assert self.user_exit when a solution is found so we can have a early termination. The performance is only slightly better (or may be not, I didn’t collect enough statistics to claim this). And neither program ever find a solution.

However, if we try to use the auto() function to find the schedule (the commented lines in main() above), the program start to find a solution, albeit not always. A few other modification increased the chance of reaching a solution:

Firstly, in simanneal module, the temperature update is based on this formula:

T = self.Tmax * math.exp(Tfactor * step / self.steps)

where Tfactor is a constant derived from Tmax and Tmin so that it guarantees the temperate T goes from Tmax to Tmin over the annealing schedule. It turns out the normal setting is having the temperature dropping too fast. This is an adverse effect to the search that trap us in a sub-optimal solution. If we change this to T = 0.99999 * T, it will be easier to reach a solution. But we can achieve the same by gauging Tmax and Tmin to closer to each other, so we have the second modification:

Secondly, we set some particular number for Tmax and Tmin (also steps). They are found empirically such that a solution can be reached in a few seconds instead of minutes. It turns out a low Tmax and comparable Tmin can maintain a good effectiveness by keeping the temperature decrease steady. The following chart shows the annealing schedule according to the exponential decay formula in simanneal above, scaled to match the starting point at step 0. The blue curve is using the value 0.5 and 0.05 respectively, as in the code above. The red curve is the default value of 2500 and 2.5 respectively. The absolute value of T does not matter in this chart but the ratio of Tmax and Tmin do. The value of T, however, plays a role in the probability function of acceptance and thus it set as such.

How about SA with constant temperature? Make Tmin and Tmax equal. How about simple hill-climbing instead of SA? Set Tmax to as close to zero as possible so the acceptance probability function is virtually 0. But a problem of combinatorial nature should not be suitable for hill-climbing search.

Reference

Lester Ingber, Adaptive simulated annealing (ASA): Lessons learned. Control and Cybernetics, 1995. arXiv:cs/0001018

Franco Busetti, Simulated annealing overview

Another implementation of a Sudoku solver in Python