Jekyll2019-03-16T16:37:31-04:00https://www.adrian.idv.hk/feed.xml∫ntegrabℓε ∂ifferentiαℓsunorganised memo, notes, code, data, and writings of random topicsAdrian S. Tamrighthandabacus@users.github.comTic-tac-toe using AI from the last century2019-03-15T19:28:28-04:002019-03-15T19:28:28-04:00https://www.adrian.idv.hk/tictactoe<p>I want to watch the computer play a game with itself. So I pick the easiest
game, tic-tac-toe, and see how well the computer can play. Tic-tac-toe is never
an interesting game. Any sensible human playing it will make a draw. So if
computer is smart enough, it should also make a draw.</p>
<h2 id="skeleton-of-self-play-engine">Skeleton of self-play engine</h2>
<p>We start with the skeleton. A game that computer play against itself is simpler
as I do not need to implement the user interface to obtain human input. So
this will be a very simple loop:</p>
<pre><code class="language-python">def play():
game = Board()
player = 'X'
while not game.won():
opponent = 'O' if player == 'X' else 'X'
game = move(game, player)
print("%s move:" % player)
print(game)
player = opponent
winner = game.won()
print()
if not winner:
print("Tied")
else:
print("%s has won" % winner)
</code></pre>
<p>But now we need to create a board representation, and some checker to verify
the game is over. To keep the board position, we can simply use a 2D array. To
check whether we have a winner, we need to check all possibilities of a
tic-tac-toe winning. And to determine if there is a tie, we need to verify that
the board is full. So here is the board class:</p>
<pre><code class="language-python">import copy
class Board:
"""simple tic-tac-toe board"""
def __init__(self, board=None):
if board:
self.board = copy.deepcopy(board)
else:
self.board = [[' '] * 3 for _ in range(3)]
def place(self, row, col, what):
"""produce a new board with row and col set to a symbol. Return None if
some symbol already set."""
if self.board[row][col] == ' ':
newboard = Board(self.board)
newboard[row][col] = what
return newboard
def __getitem__(self, key):
return self.board[key]
def __repr__(self):
separator = "\n---+---+---\n "
return " " + separator.join([" | ".join(row) for row in self.board])
def spaces(self):
"""tell how many empty spots on the board"""
return sum(1 for i in range(3) for j in range(3) if self[i][j] == ' ')
def won(self):
"""check winner. Return the winner's symbol or None"""
# check rows
for row in self.board:
if row[0] != ' ' and all(c==row[0] for c in row):
return row[0]
# check cols
for n in range(3):
if self.board[0][n] != ' ' and all(self.board[i][n] == self.board[0][n] for i in range(3)):
return self.board[0][n]
# check diag
if self.board[0][0] != ' ' and all(self.board[n][n] == self.board[0][0] for n in range(3)):
return self.board[0][0]
if self.board[0][2] != ' ' and all(self.board[n][2-n] == self.board[0][2] for n in range(3)):
return self.board[0][2]
</code></pre>
<p>We can now verify the board works by making a human playing game:</p>
<pre><code class="language-python">def play():
"auto play tic-tac-toe"
game = Board()
player = 'X'
# loop until the game is done
print(game)
while not game.won():
opponent = 'O' if player == 'X' else 'X'
while True:
userin = input("Player %s, input coordinate (0-2, 0-2):" % player)
nums = "".join(c if c.isdigit() else ' ' for c in userin).split()
if len(nums) != 2:
continue
nums = [int(n) for n in nums]
if not all(0 <= n <= 2 for n in nums):
continue
nextstep = game.place(nums[0], nums[1], player)
if nextstep:
game = nextstep
break
print()
print("%s move:" % player)
print(game)
player = opponent
winner = game.won()
print()
if not winner:
print("Tied")
else:
print("%s has won" % winner)
</code></pre>
<p>and we can see it really works:</p>
<pre><code> | |
---+---+---
| |
---+---+---
| |
Player X, input coordinate (0-2, 0-2):1,1
X move:
| |
---+---+---
| X |
---+---+---
| |
Player O, input coordinate (0-2, 0-2):1,1
Player O, input coordinate (0-2, 0-2):0,2
O move:
| | O
---+---+---
| X |
---+---+---
| |
Player X, input coordinate (0-2, 0-2):0,3
Player X, input coordinate (0-2, 0-2):1,2
X move:
| | O
---+---+---
| X | X
---+---+---
| |
Player O, input coordinate (0-2, 0-2):0,0
O move:
O | | O
---+---+---
| X | X
---+---+---
| |
Player X, input coordinate (0-2, 0-2):1,0
X move:
O | | O
---+---+---
X | X | X
---+---+---
| |
X has won
</code></pre>
<h2 id="first-step-of-ai-game-tree-search">First step of AI: Game tree search</h2>
<p>The old school way of doing AI on such board game is to do a game tree search.
The board class above actually prepared for that. Imagine we have a position
and now is a particular player’s turn. All possible positions can be generated
as follows:</p>
<pre><code class="language-python">next_steps = filter(None, [game.place(r, c, player) for r in range(3) for c in range(3)])
</code></pre>
<p>This will contain only the legitimate next step positions, i.e., we place at a
empty box only. Recursively we do this for each position, we generated a game
tree, with a depth of 9 (because we have 9 spots to play). Below is an
illustration from Wikipedia:</p>
<p><img src="https://upload.wikimedia.org/wikipedia/commons/d/da/Tic-tac-toe-game-tree.svg" alt="" /></p>
<p>The goal of playing the game is to win. If we at the <em>leaf nodes</em> of the tree,
we can determine if a player has won, lost, or it is a draw. So basically we
can make a evaluation function to score the state of the end game:</p>
<pre><code class="language-python">def evaluate(board):
"""simple evaluator: +10, -10 for someone won, 0 for tie, None for all other"""
winner = board.won()
if winner == "X":
return 10
elif winner == "O":
return -10
if not board.spaces():
return 0
</code></pre>
<p>Now we move on to the core of searching the game tree. The idea is that every
player will play to her advantage. If there are two moves that one is sure to
loss, the player must avoid it. It will be trivial if it is one level above a
leaf node of the game tree. Thus at that level, we can find the worst possible
outcome and the player suppose to <em>minimize the worst possible score</em>. Then we
have the <em>minimax algorithm</em>: At each turn, players are to minimize the maximum
loss. And the loss is computed recursively until the leaf node. So here we have
our code:</p>
<pre><code class="language-python">COUNT = 0
PLAYERS = ["X", "O"]
def minimax(board, player):
"""player to move one step on the board, find the minimax (best of the worse case) score"""
global COUNT
COUNT += 1
opponent = "O" if player == "X" else "X"
value = evaluate(board)
if value is not None:
return value # exact score of the board, at leaf node
# possible opponent moves: The worse case scores in different options
candscores = [minimax(b, opponent) for b in [board.place(r, c, player) for r in range(3) for c in range(3)] if b]
# evaluate the best of worse case scores
if player == "X":
return max(candscores)
else:
return min(candscores)
def play():
"auto play tic-tac-toe"
global COUNT
minimizer = True
game = Board()
# loop until the game is done
while not game.won():
player = PLAYERS[minimizer]
opponent = PLAYERS[not minimizer]
COUNT = 0
candidates = [(b, minimax(b, opponent)) for b in [game.place(r, c, player) for r in range(3) for c in range(3)] if b]
if not candidates:
break
random.shuffle(candidates)
# find best move: optimizing the worse case score
if player == "X":
game = max(candidates, key=lambda pair: pair[1])[0]
else:
game = min(candidates, key=lambda pair: pair[1])[0]
# print board and switch
minimizer = not minimizer
print()
print("%s move after %d search steps:" % (player, COUNT))
print(game)
winner = game.won()
print()
if not winner:
print("Tied")
else:
print("%s has won" % winner)
</code></pre>
<p>We defined the evaluation function in such a way that “X” win will have a
positive score and “O” win will have a negative score. Therefore player “X”
will try to maximize the score while player “O” will try to minimize it. Hence
we call them the maximizer and minimizer respectively and the node on the game
tree that player “X” to move the <em>maximizer node</em> and the otherwise the
<em>minimizer node</em>.</p>
<p>In the functions above, the maximizer is to maximize the potential score among
all possible next steps, and vice versa. The function <code>minimax()</code> is to find
such minimax score of a player. So when it is the maximizer, we compute the
minimax score of the minimizer on each next step positions. The <code>minimax()</code>
function in turn, compute using the next step positions of the minimizer and
recursively in a similar fashion until the leaf nodes. The game in this form
goes like the following:</p>
<pre><code>O move after 549945 search steps:
| |
---+---+---
| | O
---+---+---
| |
X move after 63904 search steps:
| | X
---+---+---
| | O
---+---+---
| |
O move after 8751 search steps:
| | X
---+---+---
| | O
---+---+---
O | |
X move after 1456 search steps:
X | | X
---+---+---
| | O
---+---+---
O | |
O move after 205 search steps:
X | O | X
---+---+---
| | O
---+---+---
O | |
X move after 60 search steps:
X | O | X
---+---+---
| X | O
---+---+---
O | |
O move after 13 search steps:
X | O | X
---+---+---
| X | O
---+---+---
O | | O
X move after 4 search steps:
X | O | X
---+---+---
| X | O
---+---+---
O | X | O
O move after 1 search steps:
X | O | X
---+---+---
O | X | O
---+---+---
O | X | O
Tied
</code></pre>
<p>The code above intentionally keep a counter <code>COUNT</code> to see how efficient the
game will be. And we randomize the possible moves at each step to work around
the issue of multiple possible next steps having same minimax score. Indeed,
the game in this form is really slow. One way to see it, a tic-tac-toe game has
9 boxes and each box can either be “X”, “O”, or blank. So there can only be
<script type="math/tex">3^9 = 19683</script> possible positions on the board. But we searched 549945
positions on the first move. This is because we searched a lot of duplicated
positions — as the same position can be reached by different combination of
moves, and the nodes on the game tree have a lot of repetitions.</p>
<h2 id="alpha-beta-pruning">Alpha-beta pruning</h2>
<p>The game tree of a game as simple as tic-tac-toe can have orders of magnitude
more nodes than the possible positions in the game. If we work on a more
complicated game, the game tree can easily go intractable. Therefore, we should
avoid searching the whole game tree.</p>
<p>Half a century ago, people invented the <em>alpha-beta pruning</em> algorithm to avoid
the part of the game tree that is know to be not interesting. The idea is not
hard to understand: Imagine it is on a maximizer node, and we have a number of
possible next moves. We check one by one for the minimax score and get some
idea of what we can do. So on the first next move, we evaluate the minimax
score on behalf of a minimizer. On the second next move, we expect a higher
score than what we got from the previous evaluation. However, as a minimizer,
it will prefer the lower score. So we can let the minimax function know that
whenever we find the minimizer saw an option of score lower than this previous
score, we can stop (<em>prune the game tree</em>) on this minimizer node – since this
minimizer node will never be an option for the maximizer node one level above.
Similar idea for searching on a minimizer node. Recursively on this idea, we
have the alpha-beta search.</p>
<p>Implementing this idea:</p>
<pre><code class="language-python">def alphabeta(board, player, alpha=-float("inf"), beta=float("inf")):
"""minimax with alpha-beta pruning. It implies that we expect the score
should between lowerbound alpha and upperbound beta to be useful
"""
global COUNT
COUNT += 1
opponent = "O" if player == "X" else "X"
value = evaluate(board)
if value is not None:
return value # exact score of the board (terminal nodes)
# minimax search with alpha-beta pruning
children = filter(None, [board.place(r, c, player) for r in range(3) for c in range(3)])
if player == "X": # player is maximizer
value = -float("inf")
for child in children:
value = max(value, alphabeta(child, opponent, alpha, beta))
alpha = max(alpha, value)
if alpha >= beta:
break # beta cut-off
else: # player is minimizer
value = float("inf")
for child in children:
value = min(value, alphabeta(child, opponent, alpha, beta))
beta = min(beta, value)
if alpha >= beta:
break # alpha cut-off
return value
</code></pre>
<p>As a convention, we call the lower bound and upper bound of the minimax score
as we learned so far as <script type="math/tex">\alpha</script> and <script type="math/tex">\beta</script> respectively. They are
initially at negative and positive infinity and narrowed down as the alpha-beta
search proceeds – we move up the lower bound on maximizer nodes and move down
the upper bound on minimizer nodes, as this is the idea of what minimax is
about. Whenever we have <script type="math/tex">\alpha \gt \beta</script>, we can prune the branch. We call
the pruning at maximizer node the beta cut-off and the one at minimizer node
the alpha cut-off.</p>
<p>Running this to replace the previous <code>minimax()</code> function will be much faster,
as less nodes are searched:</p>
<pre><code>O move after 30709 search steps:
| |
---+---+---
| | O
---+---+---
| |
X move after 9785 search steps:
| | X
---+---+---
| | O
---+---+---
| |
O move after 1589 search steps:
| | X
---+---+---
| | O
---+---+---
O | |
X move after 560 search steps:
X | | X
---+---+---
| | O
---+---+---
O | |
O move after 121 search steps:
X | O | X
---+---+---
| | O
---+---+---
O | |
X move after 53 search steps:
X | O | X
---+---+---
| X | O
---+---+---
O | |
O move after 13 search steps:
X | O | X
---+---+---
| X | O
---+---+---
O | | O
X move after 4 search steps:
X | O | X
---+---+---
| X | O
---+---+---
O | X | O
O move after 1 search steps:
X | O | X
---+---+---
O | X | O
---+---+---
O | X | O
Tied
</code></pre>
<h2 id="performance-improvement">Performance improvement</h2>
<p>There are a few areas we can improve the program to make it faster.</p>
<p>Firstly, we modify the <code>Board</code> class, as below. It will be very useful later.
We do not want to use 2D array any more. Instead, we use a bitboard – using a
bit vector to represent the board position. As there are two players and nine
boxes, we can use 18 bits to represent all positions, the 9 MSB for player “X”
and the 9 LSB for player “O”. It will be less convenient when we want to mark a
box but in return, handing a single integer is much faster than a 2D array.</p>
<p>Secondly, we use +1 and -1 instead of “X” and “O” in the code as we are now
using bitboard. We convert them into symbols only when we need to print it. The
benefit of this is that we are now easier to distinguish maximizer and
minimizer – by comparing the sign.</p>
<pre><code class="language-python">from gmpy import popcount
PLAYERS = [1, -1] # maximizer == 1
COORDS = [(r,c) for r in range(3) for c in range(3)]
def symbol(code):
"""Return the symbol of player"""
assert code in PLAYERS
return "X" if code == 1 else "O"
def grouper(iterable, n, fillvalue=None):
# https://docs.python.org/3.7/library/itertools.html
args = [iter(iterable)] * n
return itertools.zip_longest(*args, fillvalue=fillvalue)
class Board:
"""bit-vector based tic-tac-toe board"""
def __init__(self, board=0):
self.board = board
def mask(self, row, col, who):
"""Produce the bitmask for row and col
The 18-bit vector is row-major, with matrix cell (0,0) the MSB. And the
higher 9-bit is for 1 (X) and lower 9-bit is for -1 (O)
Args:
row, col: integers from 0 to 2 inclusive
"""
offset = 3*(2-row) + (2-col)
if who == 1:
offset += 9
return 1 << offset
def place(self, row, col, what: int):
"""produce a new board with row and col set to a symbol. Return None if
some symbol already set.
Args:
what: either +1 or -1
"""
assert what in PLAYERS
mask = self.mask(row, col, what)
checkmask = self.mask(row, col, -what)
if (mask | checkmask) & self.board:
return None # something already on this box
return Board(self.board | mask)
def __repr__(self):
def emit():
omask = 1 << 8
xmask = omask << 9
while omask: # until the mask becomes zero
yield "O" if self.board & omask else "X" if self.board & xmask else " "
omask >>= 1
xmask >>= 1
separator = "\n---+---+---\n "
return " " + separator.join(" | ".join(g) for g in grouper(emit(), 3))
def spaces(self):
"""tell how many empty spots on the board"""
# alternative if no gmpy: bit(self.board).count("1")
return 9 - popcount(self.board)
masks = (0b000000111, 0b000111000, 0b111000000, # rows
0b001001001, 0b010010010, 0b100100100, # cols
0b100010001, 0b001010100 # diags
)
def won(self):
"""check winner. Return the winner's symbol or None"""
shifted = self.board >> 9
for mask in self.masks:
if self.board & mask == mask:
return -1
if shifted & mask == mask:
return 1
</code></pre>
<p>In <code>spaces()</code> function above, we use the popcount function from
<a href="https://pypi.org/project/gmpy/">gmpy</a> as it is native and fast. Otherwise we
can use the function below as alternative:</p>
<pre><code class="language-python">def popcount(n):
return bin(n).count("1")
</code></pre>
<p>Secondly, we can consider memoize the minimax functions. In AI literature, this
is called the transposition table. This is possible because our minimax
function is deterministic and depends only on the board position and the
player. It will be harder if the function also depends on the depth of the game
tree (which is usually the case of chess) and the evaluation result is not
deterministic (e.g., depends on some heuristic or some guesswork involved).
Simple as this can greatly improve performance even on a full game tree search:</p>
<pre><code class="language-python">CACHE = {}
COUNT = 0
def simple_minimax(board, player);
"""player to move one step on the board, find the minimax (best of the worse case) score"""
# check cache for quick return
if (board.board, player) in CACHE:
return CACHE[(board.board, player)]
global COUNT
COUNT += 1
opponent = -player
value = evaluate(board)
if value is not None:
return value # exact score of the board
# possible opponent moves: The worse case scores in different options
candscores = [simple_minimax(b, opponent) for b in [board.place(r, c, player) for r, c in COORDS] if b]
# evaluate the best of worse case scores
if player == 1:
value = max(candscores)
else:
value = min(candscores)
# save into cache
CACHE[(board.board, player)] = value
return value
</code></pre>
<p>Here we see why a bitboard is beneficial: It is much easier to use two integers
as the key to the dictionary <code>CACHE</code>. The performance improvement is significant:</p>
<pre><code>O move after 7381 search steps:
| |
---+---+---
| | O
---+---+---
| |
X move after 0 search steps:
| | X
---+---+---
| | O
---+---+---
| |
O move after 0 search steps:
| | X
---+---+---
| | O
---+---+---
O | |
X move after 0 search steps:
X | | X
---+---+---
| | O
---+---+---
O | |
O move after 0 search steps:
X | O | X
---+---+---
| | O
---+---+---
O | |
X move after 0 search steps:
X | O | X
---+---+---
| X | O
---+---+---
O | |
O move after 0 search steps:
X | O | X
---+---+---
| X | O
---+---+---
O | | O
X move after 0 search steps:
X | O | X
---+---+---
| X | O
---+---+---
O | X | O
O move after 1 search steps:
X | O | X
---+---+---
O | X | O
---+---+---
O | X | O
Tied
</code></pre>
<p>Thirdly, there are some standard practice to improve alpha beta search. Two of
them are the <em>heuristic improvement</em> and <em><a href="https://en.wikipedia.org/wiki/Killer_heuristic">killer heuristic</a></em>.</p>
<p>The heuristic improvement means to reorder the children of a node before doing
the alpha beta search. Remember that alpha beta search checks one child node at
a time and narrow the bounds iteratively. If we have the best option as the
first child, it can make the pruning more often and thus faster in the search.</p>
<p>Killer heuristic is having a similar idea: If certain move caused pruning in
the past, it is believed that the same move will cause pruning again in another
similar position.</p>
<p>For the former, it is a bit of an art. Indeed, a lot of research have been done
to find the better evaluation function for positions of a particular game. If
we have a universally correct evaluation function that can tell whether one
position is better than another, we do not even need to do a game tree search
but rather, just pick the best next step every time according to this function.
Fortunately tic-tac-toe is a game simple enough that we have such function:</p>
<pre><code class="language-python">def heuristic_evaluate(board):
"""heuristic evaluation from <http://www.ntu.edu.sg/home/ehchua/programming/java/javagame_tictactoe_ai.html>"""
score = 0
for mask in Board.masks:
# 3-in-a-row == score 100
# 2-in-a-row == score 10
# 1-in-a-row == score 1
# 0-in-a-row, or mixed entries == score 0 (no chase for either to win)
# X == positive, O == negative
oboard = board.board
xboard = oboard >> 9
countx = popcount(xboard & mask)
counto = popcount(oboard & mask)
if countx == 0:
score -= int(10**(counto-1))
elif counto == 0:
score += int(10**(countx-1))
return score
</code></pre>
<p>The latter do not need the great mind to craft such artistic function. We just
need to remember what caused the last cut-off. Research has shown that using
last two cut-off instead of one has a better performance (power of two random
choice?) Thus we can use a <code>deque()</code> to implement the memory.</p>
<p>These two techniques are implemented in the alpha beta search as below. We can
modify the condition on the <code>if</code> statements to turn on or turn off those
techniques:</p>
<pre><code class="language-python">KILLERS = deque()
def alphabeta(board, player, alpha=-float("inf"), beta=float("inf")):
"""minimax with alpha-beta pruning. It implies that we expect the score
should between lowerbound alpha and upperbound beta to be useful
"""
if False and "Use cache":
# make alpha-beta with memory: interferes with killer heuristics
if (board.board, player) in CACHE:
return CACHE[(board.board, player)]
global COUNT
COUNT += 1
assert player in PLAYERS
opponent = -player
value = evaluate(board)
if value is not None:
return value # exact score of the board (terminal nodes)
# minimax search with alpha-beta pruning
masks = filter(None, [board.check(r, c, player) for r,c in COORDS])
children = [(mask, board.place(mask)) for mask in masks]
if False and "Heuristic improvement":
# sort by a heuristic function to hint for earlier cut-off
children = sorted(children, key=heuristic_evaluate, reverse=True)
if "Killer heuristic":
# remember the move that caused the last (last 2) beta cut-off and check those first
# <https://en.wikipedia.org/wiki/Killer_heuristic>
children = sorted(children, key=lambda x: x[0] not in KILLERS)
if player == 1: # player is maximizer
value = -float("inf")
for mask, child in children:
value = max(value, alphabeta(child, opponent, alpha, beta))
alpha = max(alpha, value)
if alpha >= beta:
KILLERS.append(mask)
if len(KILLERS) > 4:
KILLERS.popleft()
break # beta cut-off
else: # player is minimizer
value = float("inf")
for _, child in children:
value = min(value, alphabeta(child, opponent, alpha, beta))
beta = min(beta, value)
if alpha >= beta:
break # alpha cut-off
# save into cache
if "Use cache" == False:
CACHE[(board.board, player)] = value
return value
</code></pre>
<p>For the game as simple as tic-tac-toe, these improvements are, unfortunately,
not significant in time saving. But it is proved to be effective in larger
games like chess. The reason is that, the game tree of tic-tac-toe is shallow
enough to make the overhead of extra work out weight their benefit. However,
they are indeed obvious in making the number of nodes to search smaller. Below
is the result of using only killer heuristic without memoization or heuristic
improvement as in the code above. We reduced the nodes to search on first step
from 30709 to 21667:</p>
<pre><code>O move after 21667 search steps:
| |
---+---+---
| | O
---+---+---
| |
X move after 7169 search steps:
| | X
---+---+---
| | O
---+---+---
| |
O move after 1514 search steps:
| | X
---+---+---
| | O
---+---+---
O | |
X move after 532 search steps:
X | | X
---+---+---
| | O
---+---+---
O | |
O move after 121 search steps:
X | O | X
---+---+---
| | O
---+---+---
O | |
X move after 53 search steps:
X | O | X
---+---+---
| X | O
---+---+---
O | |
O move after 13 search steps:
X | O | X
---+---+---
| X | O
---+---+---
O | | O
X move after 4 search steps:
X | O | X
---+---+---
| X | O
---+---+---
O | X | O
O move after 1 search steps:
X | O | X
---+---+---
O | X | O
---+---+---
O | X | O
Tied
</code></pre>
<h2 id="principal-variation-search--negascout">Principal variation search / NegaScout</h2>
<p>There is yet another possible techniques to improve on alpha-beta pruning.
Notice that alpha-beta pruning starts with a bound <script type="math/tex">[\alpha, \beta]</script> on
expected minimax value and whenever the searched value is out of this bound,
the branch is pruned. So if we have a very tight bound, we can prune more often
and the game tree to search will be smaller. This is the idea of <a href="https://en.wikipedia.org/wiki/Principal_variation_search">principal
variation search</a>
which also comes with other names including NegaScout or MTDF(n) algorithm.
Strictly speaking they should have some subtle difference in the implementation
but share the same philosophy.</p>
<p>So when we use this technique on a node of a game tree, we first check the
first child node for a value using the ordinary alpha-beta search. Then on the
subsequent child nodes, we check their with a <em>zero-window</em>. A zero-window will
cause the branch to be pruned quickly or failed-high on a maximizer node (or
failed-low on a minimizer node). At this latter case, we are quite sure to find
a tighter bound and perform alpha-beta search again.</p>
<p>This, again, pose some overhead to the game tree search as we might need to
search the child node twice: once with zero window and once with a larger
alpha-beta window. The implementation is as follows but it turns out, not
worthwhile (either in terms of number of nodes searched or the time taken) in a
shallow game tree like tic-tac-toe.</p>
<pre><code class="language-python">def negascout(board, player, alpha=-float("inf"), beta=float("inf")) -> float:
"""minimax with alpha-beta pruning. It implies that we expect the score
should between lowerbound alpha and upperbound beta to be useful
"""
global COUNT
COUNT += 1
assert player in PLAYERS
opponent = -player
value = evaluate(board)
if value is not None:
return value # exact score of the board (terminal nodes)
# negascout with zero window and alpha-beta pruning
masks = filter(None, [board.check(r, c, player) for r,c in COORDS])
children = [(mask, board.place(mask)) for mask in masks]
# first child: alpha beta search to find value lbound/ubound
bound = negascout(children[0][1], opponent, alpha, beta)
if player == 1: # player is maximizer, bound is lbound
if bound >= beta:
return bound # beta cut-off
# subsequent children: zero window on lbound
for mask, child in children[1:]:
t = negascout(child, opponent, bound, bound+1)
if t > bound: # failed-high, tighter lower bound found
if t >= beta:
bound = t
else:
bound = negascout(child, opponent, t, beta) # re-search for real value
if bound >= beta:
return bound # beta cut-off
else: # player is minimizer, bound is ubound
if bound <= alpha:
return bound # alpha cut-off
# subsequent children: zero window on ubound
for mask, child in children[1:]:
t = negascout(child, opponent, bound-1, bound)
if t < bound: # failed-low, tigher upper bound found
if t <= alpha:
bound = t
else:
bound = negascout(child, opponent, alpha, t) # re-search for real value
if bound <= alpha:
return bound # alpha cut-off
return bound
</code></pre>
<h2 id="monte-carlo-tree-search">Monte-Carlo tree search</h2>
<p>Above we discussed the alpha-beta search with a different variations. We made
various attempt to narrow down the scope of search on the game tree.</p>
<p>There is another way to save time on the search, based on a totally different
idea. If we are on a node and a particular player’s turn. We still want to
minimize our maximum loss. We can pretend, on each child node, how the game
might proceed until the end by playing random moves and repeat for multiple
times and count how often we will win or loss. Then we pick the next step that
gave us least percentage of loss. This is a Monte-Carlo search on the game
tree. The code is surprising simple:</p>
<pre><code class="language-python">def mcts(board, player):
"""monte carlo tree serach
Returns:
the fraction of tree search that the player wins
"""
N = 500 # number of rounds to search
count = 0 # count the number of wins
for _ in range(N):
step = Board(board.board)
who = player
while step.spaces():
r, c = random.choice(COORDS)
nextstep = step.place(r, c, who)
if nextstep is not None:
who = -who # next player's turn
step = nextstep
if step.won(): # someone won
break
if step.won() == player:
count += 1
return count / N
def play():
"auto play tic-tac-toe"
minimizer = True
game = Board()
# loop until the game is done
while not game.won():
player = PLAYERS[minimizer]
opponent = PLAYERS[not minimizer]
candidates = [(b, mcts(b, opponent)) for b in [game.place(r, c, player) for r, c in COORDS] if b]
if not candidates:
break
random.shuffle(candidates)
# find best move: min opponent's score
game, score = min(candidates, key=lambda pair: pair[1])
# print board and switch
minimizer = not minimizer
print()
print("%s move on score %f:" % (symbol(player), score))
print(game)
winner = game.won()
print()
if not winner:
print("Tied")
else:
print("%s has won" % symbol(winner))
</code></pre>
<p>The <code>while</code> loop in function <code>mcts()</code> will stop only when the game end. The
function counts how many times the player wins among the <code>N</code> repetitions. When
we play with MCTS, we try to minimize the percentage of opponent win – and we
do not have the distinction of maximizer or minimizer nodes any more.</p>
<p>In a small game tree of tic-tac-toe, MCTS performs well:</p>
<pre><code>O move on score 0.188000:
| |
---+---+---
| O |
---+---+---
| |
X move on score 0.628000:
| |
---+---+---
| O |
---+---+---
| | X
O move on score 0.154000:
| |
---+---+---
| O | O
---+---+---
| | X
X move on score 0.508000:
| |
---+---+---
X | O | O
---+---+---
| | X
O move on score 0.000000:
| |
---+---+---
X | O | O
---+---+---
O | | X
X move on score 0.338000:
| | X
---+---+---
X | O | O
---+---+---
O | | X
O move on score 0.000000:
| | X
---+---+---
X | O | O
---+---+---
O | O | X
X move on score 0.000000:
| X | X
---+---+---
X | O | O
---+---+---
O | O | X
O move on score 0.000000:
O | X | X
---+---+---
X | O | O
---+---+---
O | O | X
Tied
</code></pre>
<p>Of course, playing the game randomly may not be a good idea. If we know how our
opponents might play each move with a probability, we can gauge the probability
of move accordingly. This is indeed the idea of modern AI of game playing and
finding such probability vector is the state of the art. But the above is
pretty much all we have for the last century.</p>
<p>Tic-tac-toe is never an interesting problem of
<a href="https://xkcd.com/1002/">research</a>. Even <a href="https://xkcd.com/832/">xkcd</a> can give
you a solution on how to play the game:</p>
<p><img src="http://imgs.xkcd.com/comics/tic_tac_toe_large.png" alt="" /></p>
<p>All code above are in the following repository: <a href="https://github.com/righthandabacus/tttai">https://github.com/righthandabacus/tttai</a></p>Adrian S. Tamrighthandabacus@users.github.comI want to watch the computer play a game with itself. So I pick the easiest game, tic-tac-toe, and see how well the computer can play. Tic-tac-toe is never an interesting game. Any sensible human playing it will make a draw. So if computer is smart enough, it should also make a draw.David et al (2016) DeepChess: End-to-end deep neural network for automatic learning in chess2019-03-13T00:00:00-04:002019-03-13T00:00:00-04:00https://www.adrian.idv.hk/dnw16-deepchess<p>Goal of the paper is to derive the evaluation function for chess from scratch
using machine learning techniques. From scratch means not even input the chess
rule to the evaluation function.</p>
<p>The evaluation function for chess usually takes chess position as input and a
score as output. As a convention, score is from white’s perspective and it is a
linear combination of all selected feature from the position.</p>
<p>The paper use a neural network and its training method is as follows: The model
to train receives two positions as input and learns to predict which position
is better (i.e., the output is binary). The data is from CCRL
(www.computerchess.org.uk/ccrl). There are 640K chess games with 221695 white
won and 164387 black won. The authors randomly extract 10 positions from each
game which the positions are not from the first 5 moves and not a capture. Each
position is converted into 773 bits:</p>
<ul>
<li>bitboard representation: two sides, six piece types, 64 squares = <script type="math/tex">2\times 6\times 64=768</script> bits</li>
<li>additional 5 bits of state: which side to move (white = 1), ability to castle
(black and white, king- and queen- side castling)</li>
</ul>
<p>One random position each from the game that white won and black won is paired
up as training data. If there are 1M position on each side, together with
swapping position of a pair, there will be 2 billion possible training data.</p>
<p>The neural network has two stages: The Pos2Vec stage is a deep autoencoder
network and used as nonlinear feature extractor. We expect that it converts a
chess position to a vector of values of high level features. The Pos2Vec
network has <em>five</em> layers of size 773-600-400-200-100 and using rectified
linear unit (ReLU) with no regularization.</p>
<p>The second stage is DeepChess, which is on top of two side-by-side Pos2Vec
networks and output layer has 2 softmax values to predict which of the two side
will win. The DeepChess network has 200 inputs (100 each from the two Pos2Vec
network) and has <em>four</em> layers of size 400-200-100-2, using ReLU with no
regularization, to compare the features of the positions from the two disjoint
Pos2Vec networks to determine which one is better.</p>
<p>The Pos2Vec network is trained for 200 epochs over 2M positions, which 1M are
white win and 1M are black win. The network is trained with first layer,
773-600-773, then second layer 600-400-600, and so on to complete the five
layers. The learning rate start from 0.005 and multiplied by 0.98 at the end of
each epoch.</p>
<p>The DeepChess network is supervised. It use previously trained Pos2Vec network
as the initial weights and after adding the four layers on top of the two
Pos2Vec networks, the whole network is trained again with 1M random input pairs
for 100 epochs. There are 100K positions each from white won and black won to
serve as validation set. Cross entropy is used. The learning rate starts from
0.01 and multiplied by 0.99 after each epoch.</p>
<p>No regularization is needed as the author claims that there are orders of
magnitude more potential training pairs than the one used. So we can always use
new training samples in each epoch.</p>
<p>The network is found to produce accuracy of 98.2% on training data and 98.0% on
validation data.</p>
<p>Figure 1 on page 3 of the paper shows the diagram of the network.</p>
<p>The paper also proposed some possible improvements. The alternative
configuration of 773-100-100-100 for Pos2Vec, and 100-100-2 instead of 4 layers
for DeepChess is called network distillation (use a smaller network). This
sacrificed the accuracy a bit.</p>
<p>To use the neural network, we need a chess engine that does alpha-beta search.
But instead of evaluating for numerical value of each position, we store the
whole position in <script type="math/tex">\alpha_{pos}</script> and <script type="math/tex">\beta_{pos}</script>. At each new position,
we compare it with existing <script type="math/tex">\alpha_{pos}</script> and <script type="math/tex">\beta_{pos}</script>, to check:</p>
<ul>
<li>if the position is better than <script type="math/tex">\alpha_{pos}</script>, it would become the new <script type="math/tex">\alpha_{pos}</script></li>
<li>if the position is better than <script type="math/tex">\beta_{pos}</script>, the current node is pruned</li>
</ul>
<h1 id="further-reading">Further reading</h1>
<ul>
<li>references 3, 4, 5, 6: genetic algorithms for auto evaluation function tuning when features are initialized randomly</li>
<li>reference 10: deep reinforcement learning to play chess</li>
</ul>Adrian S. Tamrighthandabacus@users.github.comGoal of the paper is to derive the evaluation function for chess from scratch using machine learning techniques. From scratch means not even input the chess rule to the evaluation function.Silver et al (2017) Mastering the game of Go without human knowledge2019-03-12T00:00:00-04:002019-03-12T00:00:00-04:00https://www.adrian.idv.hk/sssahghblbclhsdgh17-alphagozero<p>This is the AlphaGo Zero paper that gave out detail on how the reinforcement learning is done.</p>
<p>The predecessor AlphaGo Fan was a success on Oct 2015, so as AlphaGo Lee. It was implemented as two deep neural networks, a policy network (that give out move probabilities) and a value network (that outputs position evaluation). The policy network was trained initially by <em>supervised learning</em> to predict human expert moves, and then refined by policy gradient reinforcement learning. The value network was trained to predict the winner of the game, by playing with the policy network against itself. After the two neural networks are trained, they are combined with Monte Carlo tree search to provide a lookahead search.</p>
<p>AlphaGo zero skipped the supervised learning part. It was trained solely by self-play reinforcement learning, started from random play. The result will be a single neural network instead of separated policy and value network. And it uses only black and white stones on the board as input features. When it is used, it will only be simple tree search without Monte Carlo rollouts. The reinforcement learning is performed as follows:</p>
<p>Let <script type="math/tex">f_{\theta}</script> be a deep neural network with parameters <script type="math/tex">\theta</script> and it takes the raw board representation <script type="math/tex">s</script> of the position as input. The neural network outputs move probabilities and a value <script type="math/tex">(\mathbf{p}, v) = f_{\theta}(s)</script>, where:</p>
<ul>
<li><script type="math/tex">\mathbf{p}</script>: vector of move probabilities, which prob of selecting each move <script type="math/tex">a</script> is <script type="math/tex">p_a = \Pr[a\mid s]</script></li>
<li><script type="math/tex">v</script>: scalar value, estimating the prob of current player winning from position <script type="math/tex">s</script></li>
</ul>
<p>At each position <script type="math/tex">s</script>, an MCTS is guided by the neural network <script type="math/tex">f_{\theta}</script> to find probabilities <script type="math/tex">\pi</script> of playing each move and value (game winner) <script type="math/tex">z</script>. <script type="math/tex">\pi</script> usually select much stronger moves than the raw move probabilities <script type="math/tex">\mathbf{p}</script>. The MCTS is a policy improvement operator and <script type="math/tex">z</script> a policy evaluation. Then we update the parameters <script type="math/tex">\theta</script> of the neural network to make <script type="math/tex">(\mathbf{p}, v) = f_{\theta}(s)</script> more closely match <script type="math/tex">(\pi, z)</script></p>
<p>The MCTS always start from root state and select to move to maximize the upper confidence bound <script type="math/tex">Q(s,a)+U(s,a)</script>, where <script type="math/tex">U(s,a)\propto P(s,a)/(1+N(s,a))</script>, until leaf node <script type="math/tex">s'</script> is reached. Here,</p>
<ul>
<li>edge of the game tree is denoted by <script type="math/tex">(s,a)</script>, with <script type="math/tex">s</script> the board state and <script type="math/tex">a</script> the action</li>
<li><script type="math/tex">P(s,a)</script> = prior probability</li>
<li><script type="math/tex">N(s,a)</script> = visit count</li>
<li><script type="math/tex">Q(s,a)</script> = action value</li>
</ul>
<p>At <script type="math/tex">s'</script>, we then evaluate the prior probabilities and evaluate <script type="math/tex">(P(s',\cdot), V(s')) = f_{\theta}(s')</script>. We then update each edge <script type="math/tex">(s,a)</script> traversed to increment its visit count <script type="math/tex">N(s,a)</script> and to update its action value</p>
<script type="math/tex; mode=display">Q(s,a) = \frac{1}{N(s,a)}\sum_{s'\mid s,a\to s'} V(s')</script>
<p>which the summation above is over all simulations that start from position <script type="math/tex">s</script> taking move <script type="math/tex">a</script> eventually reached <script type="math/tex">s'</script>.</p>
<p>The neural network is trained as follows:</p>
<p>The initial weights <script type="math/tex">\theta_0</script> are random. In each subsequent iteration <script type="math/tex">i\ge 1</script>, games of self-play, where each move is identified by subscript <script type="math/tex">t</script>, are generated. At time <script type="math/tex">t</script>, MCTS outputs <script type="math/tex">\mathbf{\pi}_t = \alpha_{\theta_{i-1}}(s_t)</script> using neural network <script type="math/tex">f_{\theta_{i-1}}</script> and moves by sampling the search probabilities <script type="math/tex">\mathbf{\pi}_t</script>. The game terminates only it exceeds a maximum length, or when both players pass. A player pass when the search value below a resignation threshold. Therefore, at each time step, we collected data <script type="math/tex">(s_t, \mathbf{\pi}_t, z_t)</script>, which <script type="math/tex">z_t \in \{-1, +1\}</script> is the game winner from the perspective of the current player at step <script type="math/tex">t</script>. The a new <script type="math/tex">\theta_i</script> is trained from <script type="math/tex">(s,\mathbf{\pi},z)</script> sampled uniformly among all time steps of iteration <script type="math/tex">i-1</script>.</p>
<p>The neural network <script type="math/tex">(p,v) = f_{\theta_i}(s)</script> is adjusted to minimize the error between the predicted value <script type="math/tex">v</script> and the self-play winner <script type="math/tex">z</script>, and to maximize the similarity between vectors of move probabilities <script type="math/tex">p</script> and search probabilities <script type="math/tex">\pi</script>. It is done using gradient on loss function</p>
<script type="math/tex; mode=display">\ell = (z-v)^2 - \pi^T \log p + c \lVert\theta\rVert^2</script>
<p>for some parameter <script type="math/tex">c</script> of regularization to prevent overfitting. The above loss function is summing over mean-squared error and cross-entropy loss.</p>
<p>Training of AlphaGo Zero: 4.9M games of self-play generated, using 1600 simulations for each MCTS, which corresponds to 0.4s thinking time per move.</p>Adrian S. Tamrighthandabacus@users.github.comThis is the AlphaGo Zero paper that gave out detail on how the reinforcement learning is done.Hypercycloid, hypertrochoid, hypocycloid, and hypotrochoid2019-03-06T20:02:47-05:002019-03-06T20:02:47-05:00https://www.adrian.idv.hk/hypocycloid<p>Hypercycloid, hypocycloid, and more general version, the hypertrochoid and
hypotrochoid, are curves of the locus of a point on a circle rolling on a
bigger circle. Like many other locus problems, it is convenient to tackle it
from parametric equations.</p>
<p>We go with the hypercycloid (aka epicycloid) first. Consider the image below
from <a href="https://en.wikipedia.org/wiki/Epicycloid">Wikipedia</a>, we have a bigger
circle of radius <script type="math/tex">R</script> with centre fixed at the origin. The smaller, rolling
circle of radius <script type="math/tex">r</script> is rolling on the outside of the circle such that there
is always a single point of intersection between the two circles. The locus of
interest is drawn by point <script type="math/tex">P</script> on the smaller circle while it rolls.</p>
<p><img src="https://upload.wikimedia.org/wikipedia/commons/6/61/Epizykloide_herleitung.svg" alt="" /></p>
<p>Observe that when the smaller circle is rolling, its centre always follows a
circle of radius <script type="math/tex">R+r</script> centred at the origin. Assume at the moment that the
smaller circle rolled to the position such that its centre is at angle
<script type="math/tex">\theta</script> as illustrated, the length of arc it has rolled is <script type="math/tex">R\theta</script>. This
is the same length measured on the big or small circle. Assume point <script type="math/tex">P</script> is
the point of contact of the two circle when <script type="math/tex">\theta=0</script>. At the moment of an
unspecified <script type="math/tex">\theta</script>, the point <script type="math/tex">P</script> is at the angle of <script type="math/tex">\alpha =
R\theta/r</script> relative to the point of contact of the two circle at the moment,
or at the angle of <script type="math/tex">\alpha+\theta</script> relative to the <script type="math/tex">x</script> axis (such angle is
measured at the third quadrant).</p>
<p>Given that we have the coordinate of the centre of the smaller circle to be</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
x &= (R+r)\cos\theta \\
y &= (R+r)\sin\theta
\end{align} %]]></script>
<p>and the coordinate of the point <script type="math/tex">P</script> to be</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
x &= (R+r)\cos\theta - r\cos(\frac{R+r}{r}\theta) \\
y &= (R+r)\sin\theta - r\sin(\frac{R+r}{r}\theta)
\end{align} %]]></script>
<p>and more generally, if point <script type="math/tex">P</script> is on a circle of radius <script type="math/tex">\rho</script> eccentric
to the smaller circle, then the parametric formula of the locus of the
hypertrochoid (aka epitrochoid) is:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
x &= (R+r)\cos\theta - \rho\cos(\frac{R+r}{r}\theta) \\
y &= (R+r)\sin\theta - \rho\sin(\frac{R+r}{r}\theta)
\end{align} %]]></script>
<p>The derivation is similar if the smaller circle is rolled on the inside of the
bigger circle. Except that the angle of point <script type="math/tex">P</script> relative to the <script type="math/tex">x</script> axis
when the centre of the smaller circle is at angle <script type="math/tex">\theta</script> is
<script type="math/tex">\alpha-\theta</script> (measured at the first quadrant), as now the point is on the
clockwise side rather than counterclockwise side when the smaller circle
rolled. So similarly, the parametric equation of hypocycloid is:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
x &= (R-r)\cos\theta + r\cos(\frac{R-r}{r}\theta) \\
y &= (R-r)\sin\theta - r\sin(\frac{R-r}{r}\theta)
\end{align} %]]></script>
<p>and the more general version, hypotrochoid, is:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
x &= (R-r)\cos\theta + \rho\cos(\frac{R-r}{r}\theta + \phi) \\
y &= (R-r)\sin\theta - \rho\sin(\frac{R-r}{r}\theta + \phi)
\end{align} %]]></script>
<p>In above, we added an angle <script type="math/tex">\phi</script> to <script type="math/tex">\alpha</script> such that we allow a version
rotated about the origin. The shape, however, should be just the same.</p>
<p>Now some code. I like the animated GIF on wikipedia page that shows how the
locus is created as the parameter <script type="math/tex">\theta</script> goes from 0 up to some big angle.
Generating such animation is indeed not hard, as we already derived the
coordinates and metrics of everything need to show. I will use Python, for its
Pillow library is handy to create such pictures. And in addition to GIF, I can
also generate animated image in Google’s WebP format. Here is the code (python
3.6+ required due to type hint syntax):</p>
<script src="https://gist.github.com/righthandabacus/97dff2233b37230b7c27d5a0001586bf.js"></script>
<p>and this is the command to generate a hypercycloid:</p>
<pre><code>python3 hypchoid.py -q 180 hyper.webp
</code></pre>
<p><img src="/img/hyper.webp" alt="" /></p>
<p>and this is for a hypochoid:</p>
<pre><code>python3 hypchoid.py -p 50 -o hypo.webp
</code></pre>
<p><img src="/img/hypo.webp" alt="" /></p>Adrian S. Tamrighthandabacus@users.github.comHypercycloid, hypocycloid, and more general version, the hypertrochoid and hypotrochoid, are curves of the locus of a point on a circle rolling on a bigger circle. Like many other locus problems, it is convenient to tackle it from parametric equations.世界一速く結果を出す人は、なぜ、メールを使わないのか2019-02-28T00:00:00-05:002019-02-28T00:00:00-05:00https://www.adrian.idv.hk/googlework<h2 id="第１章世界より速く動くための仕事術">第１章 世界より速く動くための仕事術</h2>
<ul>
<li>Finish the job in one shot: Do not procrastinate if you can do something, and do as much as you can</li>
<li>Use Google Doc to allow work done on the premise whenever we need to</li>
<li>Set deadline for tasks: We can’t focus and schedule without a deadline
<ul>
<li>ASAP is a red flag, it will never be done</li>
</ul>
</li>
<li>Strategy for tasks can’t finish immediately:
<ul>
<li>Sort out items “understood” vs “need to figure out” and focus on former</li>
<li>Ask for clarification</li>
<li>Put it down for a rest</li>
</ul>
</li>
<li>Instead of email, use Google Doc for real-time collaboration
<ul>
<li>Email = asynchronous, slow</li>
<li>Avoided hassles of versioning and renaming filenames</li>
</ul>
</li>
</ul>
<h2 id="第２章ロジカルシンキングなんてしている暇はない">第２章 ロジカルシンキングなんてしている暇はない！</h2>
<ul>
<li>Logic and analysis are to explain, insight and intuition are for creativity</li>
<li>Tools for creativity:
<ul>
<li>Group brainstorm</li>
<li>Clue cards</li>
<li>Combining random, unrelated items</li>
</ul>
</li>
</ul>
<h2 id="第３章忙しくても１０倍の結果を出すために">第３章 忙しくても、１０倍の結果を出すために</h2>
<ul>
<li>Leadership: Brave to leave the comfort zone</li>
<li>Lose the battle to win the war</li>
<li>Understand risk: Not to find reason to give up, but for success</li>
<li>Objective: Do not base on last year’s achievement + 10%
<ul>
<li>This will have more work but never go out of the box</li>
<li>Work on “high impact” issues</li>
<li>Think like an owner (high level). To eliminate your own job</li>
</ul>
</li>
<li>Rules for 10x result
<ul>
<li>Be a forerunner
<ul>
<li>Predict opportunity and threats</li>
<li>Work out a trend and pattern</li>
<li>Predict progress at 5 week / 5 months / 5 years later</li>
</ul>
</li>
<li>Think from a third person’s perspective</li>
<li>Have my own understanding, and know my comfort zone</li>
<li>Speak up and tell people the truth</li>
<li>Pick up responsibilities</li>
<li>Active to participate</li>
<li>Listen to my intuition</li>
<li>Break the norm</li>
<li>Fail fast</li>
<li>Be curious</li>
<li>Think in a different angle
<ul>
<li>Look at the whole</li>
<li>Look at details</li>
<li>Reverse side</li>
<li>Retrospective from future</li>
<li>From customer’s perspective</li>
<li>From competitor’s perspective</li>
</ul>
</li>
</ul>
</li>
</ul>
<h2 id="第４章仕事の加速度を上げる人間関係のつくり方">第４章 仕事の加速度を上げる人間関係のつくり方</h2>
<ul>
<li>Prototype thinking: Communicate with a prototype, not verbal
<ul>
<li>Use pseudo-code, sketches</li>
</ul>
</li>
<li>Manage your manager
<ul>
<li>Ask for weekly meeting / guidance</li>
</ul>
</li>
<li>People need psychological safety</li>
</ul>
<h2 id="第５章必要なことを高速で学ぶ方法">第５章 必要なことを高速で学ぶ方法</h2>
<ul>
<li>How to ask: Say your hypothesis and ask for comment</li>
<li>NLP pyramid of neurological level
<ul>
<li>Identity (top)</li>
<li>Beliefs and values</li>
<li>Attitude</li>
<li>Capabilities</li>
<li>Behaviour</li>
<li>Environment (bottom)</li>
</ul>
</li>
<li>Change in upper level drives changes in lower levels</li>
<li>To change behaviour: Change beliefs, attitude, capabilities</li>
<li>People network: Weak ties are more useful than strong ties
<ul>
<li>Weak ties = Not interact very often</li>
</ul>
</li>
<li>Growth mindset</li>
</ul>
<h2 id="第６章グーグルの疲れない働き方">第６章 グーグルの疲れない働き方</h2>
<ul>
<li>Schedule time for monotasking and multitasking</li>
<li>Emails that we can’t reply immediately:
<ul>
<li>Need more work (to confirm, etc)</li>
<li>Make me feel angry (hot cognition): Can’t react reasonably</li>
</ul>
</li>
<li>Deal with emotion, not to suppress emotion</li>
<li>Gary Hamel: Creative economy = Work 3.0
<ul>
<li>Work 2.0 = Knowledge economy</li>
<li>Work 1.0 = Production economy</li>
</ul>
</li>
</ul>
<h2 id="終章自分の仕事を壊せる人が次の時代をつくる">終章 自分の仕事を壊せる人が、次の時代をつくる</h2>
<ul>
<li>To keep you a job: To eliminate it yourself</li>
<li>Best way of predicting the future is to make the future</li>
<li>Intrapreneur</li>
</ul>
<p>Chinese translation:</p>
<pre><code class="language-coffeescript">@book{
title = {Google神速工作術: 全世界最聰明的Google菁英如何創造10倍的工作成果?},
author = {彼優特．菲利克斯．吉瓦奇 Piotr Feliks Grzywacz},
editor = {劉愛夌},
publisher = {平安文化有限公司},
year = 2018,
month = February,
isbn = 9789869607704,
pages = 256,
}
</code></pre>Adrian S. Tamrighthandabacus@users.github.com第１章 世界より速く動くための仕事術DeCandia et al (2007) Dynamo2019-02-14T23:10:00-05:002019-02-14T23:10:00-05:00https://www.adrian.idv.hk/dhjklpsvv07-dynamo<p>Objectives of dynamo</p>
<ul>
<li>dealing with failures in an infrastructure comprised of millions of components</li>
<li>data store that only need primary-key access
<ul>
<li>use key value store, as RDBMS will limit scale and availability</li>
</ul>
</li>
</ul>
<p>Techniques used</p>
<ul>
<li>data partitioned and replicated using consistent hashing</li>
<li>object versioning to maintain consistency</li>
<li>quorum-like technique and synchronization protocol to maintain consistency among replicas</li>
</ul>
<p><strong>§2.1 System Assumption & Requirement</strong></p>
<ul>
<li>Query: Key → Blob (blob size < 1 MB)</li>
<li>Give up ACID because of poor availability
<ul>
<li>no isolation as in ACID, allow only single key update</li>
</ul>
</li>
<li>Use 99.9th percentile of latency as efficiency metric and SLA definition</li>
<li>Assume to use in internal, non-hostile environment</li>
</ul>
<p><strong>§2.3 Design considerations</strong></p>
<p>Synchronous replica coordination = Strong consistency = Trade-off availability under failure scenario</p>
<ul>
<li>dynamo maintain only eventual consistency</li>
</ul>
<p>Update conflict resolution at write vs at read</p>
<ul>
<li>traditional: at write time, but failing write availability if cannot reach all or most of the replica</li>
<li>dynamo: conflict resolution at read</li>
</ul>
<p>Conflict resolution scheme</p>
<ul>
<li>at data store: we can only have last write wins</li>
<li>at application: allows app-specific merge algorithm</li>
</ul>
<p>Cluster design:</p>
<ul>
<li>incremental scale out</li>
<li>symmetry: all nodes have same responsibilities</li>
<li>decentralization</li>
</ul>
<p><strong>§3 Related work</strong></p>
<ul>
<li>P2P network with enough routing information at each peer can employ O(1) routing</li>
<li>Design to be robust against network partition</li>
<li>Do not focus on data integrity and security, but target for always writable</li>
</ul>
<p><strong>§4.1 System interface</strong></p>
<p><code>get(key)</code>: return single object or multiple objects of conflicting versions</p>
<p><code>put(key, object)</code>: write replica to disk, with metadata like versions</p>
<p><strong>§4.2 Partitioning</strong></p>
<p>Partition of data over many storage nodes: use consistent hashing</p>
<ul>
<li>ring of hash values, and “virtual nodes” placed on the ring</li>
<li>virtual nodes assigned a random value in the ring</li>
<li>coordinator node responsible for the range between it and its predecessor</li>
<li>each range is replicated to next <script type="math/tex">N-1</script> nodes in the ring in clockwise direction</li>
<li>each instance of dynamo host has multiple virtual nodes</li>
</ul>
<p>Physical node = multiple virtual nodes for uniform load distribution even if the hash value is not uniform</p>
<p>Always <script type="math/tex">N</script> replicas for each data object: for high availability</p>
<p><strong>§4.3 Replication</strong></p>
<p>Every key has <script type="math/tex">N</script> replicas in the system. Coordinator node replicates a key
to <script type="math/tex">N-1</script> clockwise successor nodes in the ring</p>
<p><strong>§4.4 Data versioning</strong></p>
<ul>
<li>for eventual consistency</li>
<li>example: shopping cart is data, add to cart or remove from cart are put requests that create new version of the data</li>
<li>divergent versions are reconciled later
<ul>
<li>syntactic reconciliation: system knows one version is older than another</li>
<li>semantic reconciliation: only client knows how to collapse multiple branches of data evolution into one</li>
</ul>
</li>
</ul>
<p><strong>§4.5 Get and put operations</strong></p>
<p>Quorum system:</p>
<ul>
<li>get or put requests are sent to top <script type="math/tex">N</script> reachable nodes in the preference
list and wait for <script type="math/tex">W</script> or <script type="math/tex">R</script> responses before returns to client</li>
<li>successful read need <script type="math/tex">R</script> nodes’ response</li>
<li>successful write need <script type="math/tex">W</script> nodes’ response</li>
<li><script type="math/tex">R+W > N</script>, typically <script type="math/tex">R=W=2</script>, <script type="math/tex">N=3</script></li>
</ul>
<p><strong>§4.6 Hinted hand-off</strong></p>
<p>Always maintain <script type="math/tex">N</script> replicas to <script type="math/tex">N</script> healthy nodes from preference list of a key</p>
<p>Example: key of an object should go to node A, and replica to B and C. But if
node A is unreachable, a replica will sent to D as well. All replicas (esp. the
one to D) will hint on their metadata that A should be the coordinator. So when
A restored, D will forward the object back to A.</p>
<p><strong>§4.7 Replica synchronization</strong></p>
<p>Merkle tree: Hash tree such that</p>
<ul>
<li>leaves are hashes of values of individual keys</li>
<li>parent nodes = hashes of their children</li>
</ul>
<p>Dynamo node maintain a separate Merkle tree for each key range and use tree
traversal to determine if there are any difference and perform synchronization</p>
<p><strong>§4.8.1 Ring membership</strong></p>
<p><em>Explicit</em> mechanism to add or remove node from a dynamo ring.</p>
<p>Persistent store is used to hold membership. Gossip-based protocol to propagate
membership changes and maintain an eventual consistent view of membership.</p>
<ul>
<li>each node randomly connect another node in the range and synchronize their membership change histories</li>
<li>partitioning and placement also through gossip protocols</li>
</ul>
<p><strong>§4.8.2 External discovery</strong></p>
<ul>
<li>some nodes are “seed nodes” that all node to join a ring will know them and sync membership with them</li>
<li>seed nodes are discovered by external mechanism such as configuration service</li>
</ul>
<p><strong>§4.8.3 Failure detection</strong></p>
<ul>
<li>failure detection to avoid attempt to communicate with unreachable peers
<ul>
<li>local notation of failure: no response to messages sent</li>
<li>decentralized failure detection: reported over gossip protocol</li>
</ul>
</li>
</ul>
<p>Only temporary failure should be detected and learned from gossip protocol.
Permanent node add/removal should be done using explicit mechanism to avoid
confusion.</p>
<p><strong>§4.9 Add/remove storage nodes</strong></p>
<p>New node into system:</p>
<ul>
<li>assigned with a few new tokens randomly scattered on the ring</li>
<li>key range changes, some nodes have to transfer the range to other nodes</li>
<li>transfer offer and confirmation messages between nodes
<ul>
<li>confirmation avoids duplicated transfers</li>
</ul>
</li>
</ul>
<p><strong>§5 Implementation</strong></p>
<p>Each dynamo node has the following function:</p>
<ul>
<li>request coordination (handles get and put)</li>
<li>membership & failure detection</li>
<li>local persistence engine (non-volatile storage of key-value data)</li>
</ul>
<p>Local persistence: use MySQL, Berkeley DB, etc.</p>
<p>Request coordination: Read write requests</p>
<ul>
<li>event-driven messaging substrate used</li>
<li>communication = Java NIO channels</li>
<li>example of read
<ol>
<li>send read request to nodes</li>
<li>wait for min number of required responses</li>
<li>if too few responses within time bound, fail the request</li>
<li>gather all responses and determine which one to return</li>
<li>syntactic reconciliation if needed and generate write request</li>
</ol>
</li>
<li>coordinator do “read repair” if stale versions are found</li>
<li>optimize for read-after-write: write coordinator is chosen to be the one
replied fastest to the previous read</li>
</ul>
<p><strong>§6 Performance</strong></p>
<p>Write latency is longer than read latency because of disk access for persistent
storage. The 99.9th percentile latency is around 200ms.</p>
<p><strong>§6.2 Uniform load distribution</strong></p>
<p>Different partitioning scheme of key ranges in ring evaluated (see fig 3 of the paper):</p>
<p>Strategy 1: <script type="math/tex">T</script> random tokens per node, partition by token value</p>
<ul>
<li>when nodes joins/leaves, ranges changes</li>
<li>space at each node to maintain membership is linear to the number of nodes in system</li>
<li>node join = steal key range from other nodes
<ul>
<li>nodes to scan persistent store to hand off data</li>
<li>slow bootstrapping, recalculation of Merkle tree on many nodes</li>
</ul>
</li>
</ul>
<p>Strategy 2: <script type="math/tex">T</script> random tkoens per node, equal sized partition</p>
<ul>
<li>hash space divided to <script type="math/tex">Q</script> equal sized partitions</li>
<li>Each node assigned <script type="math/tex">T</script> tokens, <script type="math/tex">S</script> nodes in the system</li>
<li><script type="math/tex">Q \gg N</script> and <script type="math/tex">Q \gg ST</script>, <script type="math/tex">Q</script> is a fixed constant</li>
</ul>
<p>Strategy 3: <script type="math/tex">Q/S</script> tokens per node, equal sized partition</p>
<ul>
<li>each node has <script type="math/tex">Q/S</script> tokens</li>
<li>node leaves the system = redistribute its tokens to remaining nodes</li>
<li>node joins the system = steal tokens from other nodes to maintain <script type="math/tex">Q/S</script> tokens per node</li>
</ul>
<p>Strategy 3 has best load balancing efficiency</p>
<ul>
<li>fast to bootstrap</li>
<li>easy to archive</li>
</ul>
<p>Because nodes now handles the <script type="math/tex">Q</script> partitions separately, we can assume and
renounce responsibility of a partition like adding and removing a file. No need
to recalculate the Merkle tree.</p>
<p><strong>§6.3 Divergent versions</strong></p>
<p>Divergent versions due to (1) node failure and (2) concurrent write</p>
<p><strong>§6.4 Coordination</strong></p>
<p>Request coordination component: client vs server</p>
<p>Server: client send request to any node on the ring</p>
<ul>
<li>send by load balancer</li>
<li>any dynamo node can be a coordinator for read and any node on the key’s
preference list can be coordinator for write</li>
</ul>
<p>Client: At application library to implement the request coordination logic</p>
<ul>
<li>client needs to sync membership state with a random dynamo node periodically</li>
<li>better performance in latency</li>
<li>efficiency depends on how fresh the membership information is at the client</li>
</ul>
<h1 id="references">References</h1>
<p><a href="https://docs.datastax.com/en/cassandra/3.0/cassandra/architecture/archTOC.html">Understanding the architecture of
Cassandra</a>:
Some authors of dynamo built Cassandra and they share a few architectural strategies</p>Adrian S. Tamrighthandabacus@users.github.comObjectives of dynamo大規模Web服務開發技術2019-02-09T00:00:00-05:002019-02-09T00:00:00-05:00https://www.adrian.idv.hk/webservice<h2 id="第1課-大規模服務和小規模服務">第1課 大規模服務和小規模服務</h2>
<p>Scale up: 提高硬件性能</p>
<p>Scale out: 增加服務器數量</p>
<p>用廉價硬件確保可擴展性</p>
<p>用load balancer</p>
<h2 id="第2課-持續增長的服務和大規模化的障礙">第2課 持續增長的服務和大規模化的障礙</h2>
<p>用LVS + keepalived 這種負載均衡器+運行狀態盡視的開源軟件</p>
<p>逐漸推行虛擬操作系統</p>
<h2 id="第3課-服務開發現場">第3課 服務開發現場</h2>
<p>實現過程中盡量書寫測試用例 - TDD</p>
<p>編程語言的選擇策略是同一層僅使用一種語言</p>
<h2 id="第5課-大規模數據處理的難點">第5課 大規模數據處理的難點</h2>
<p>大規模 = 無法 in-memory computation</p>
<p>查找bottleneck:</p>
<ul>
<li><code>top</code> 或 <code>sar</code> for processes</li>
<li><code>ps</code> for CPU time</li>
<li><code>strace</code> 或 <code>oprofile</code> 進行剖測</li>
</ul>
<h2 id="第6課-可擴展性的要點">第6課 可擴展性的要點</h2>
<p>硬件價格和性能不成比例</p>
<p>CPU擴展可借助於load balancer</p>
<p>I/O擴展十分困難，可借助於大規模數據庫</p>
<h2 id="第8課-操作系統的緩存機制">第8課 操作系統的緩存機制</h2>
<p>LRU cache: Linux用inode編號識別文件，以文件inode編號和offset作鍵進行caching</p>
<p>OS內部使用Radix tree作caching搜索</p>
<ul>
<li>Radix tree similar to Patricia tree</li>
</ul>
<h2 id="第10課-利用局部性的分布式">第10課 利用局部性的分布式</h2>
<p>Partitioning，將一個數據庫分割到多台服務器上</p>
<p>相應cache可以用在別的地方，整體來看，in-memory數據量就增加了</p>
<p>分割成島: Google/Yahoo spider分配到一個島，人類用戶分配到另一個</p>
<ul>
<li>Spider會訪問很久很久以前的網頁，人類不會</li>
<li>因應寬廣度，caching很難奏效</li>
</ul>
<h2 id="第13課-mysql的橫向擴展和partitioning">第13課 MySQL的橫向擴展和Partitioning</h2>
<p>Redundancy = 4台server</p>
<ul>
<li>如果只用3台，即1 master+2 slave，其中一台slave發生故障，就只剩下 1
master+1 slave運作。然後用新server替換，就要從其中一台複製數據，
過程中就要將餘下的一台slave停機，即將僅餘的redundancy都失去</li>
<li>如果master只作寫入，slave只作讀取，任何情況下都要有2台運作</li>
</ul>
<p>Partitioning: 如果加RAM可以應付就不用Partitioning</p>
<h2 id="第17課-可變字節碼和速度的感覺">第17課 可變字節碼和速度的感覺</h2>
<p>Variable byte code: 正常integer = 4 bytes，但不是每一個integer都要4
byte。如果要儲存array of integer，e.g. [5, 130]</p>
<pre><code>5 = 00000000 00000000 00000000 00000101
130 = 0000000 00000000 00000000 10000010
</code></pre>
<p>VBC: 7 LSB作數字，MSB作flag，只有最後一byte會assert MSB</p>
<pre><code>5 = 1000101
130 = 00000001 10000010
[5, 130] = 1000101 00000001 10000010
</code></pre>
<p>其他方法壓縮儲存: 用數值差e.g., [5, +125]</p>
<ul>
<li>差的分佈呈幾何分佈</li>
</ul>
<p>其他方法: Gamma code, delta code, Golomb code, interpolate encoding</p>
<h2 id="第20課-hatena-diary的關鍵字鏈接">第20課 Hatena Diary的關鍵字鏈接</h2>
<p>Aho-Corasick算法: One large string (document), a series of keywords to search from document</p>
<ul>
<li>Keywords form a trie</li>
<li>Trie: Add failed links to other nodes of the trie, e.g. <code>[ABCE, BCB]</code>
A -> B -> C -> E
B -> C -> B
Fail link: <code>C</code> on first chain to second <code>B</code> on second chain, so when after
<code>C</code> we failed to match <code>E</code>, we allow it to match <code>B</code></li>
<li>The trie becomes a state machine. We walk the document on the state machine
and enumerate all “accept” nodes ever traversed.</li>
</ul>
<h2 id="第22課-創建hatena關鍵字鏈接">第22課 創建Hatena關鍵字鏈接</h2>
<p>Goal:</p>
<ol>
<li>給定keyword set</li>
<li>輸入文章</li>
<li>從文章找出關鍵字並返回 offset</li>
<li>將關鍵字部分用 <code><a href=”..”> .. </a></code> 替換</li>
</ol>
<p>Solution:</p>
<ol>
<li>Aho-Corasick算法，用keyword set建trie (Aho-Corasick Automata)</li>
<li>從Trie的root作BFS，失敗則從failure link返回</li>
</ol>
<h2 id="第25課-搜索系統的架構">第25課 搜索系統的架構</h2>
<p>全文搜索: grep</p>
<ul>
<li>Document size m, keyword size n, complexity = O(mn)</li>
<li>Knuth-Morris-Pratt algorithm: O(m+n)
<ul>
<li>KMP algorithm is Aho-Corasick Algorithm with only one keyword</li>
</ul>
</li>
<li>Boyer-Moore algorithm: O(mn) in worst case, O(n/m) in best case</li>
</ul>
<p>後綴類型</p>
<ul>
<li>Trie, suffix array, suffix tree</li>
<li>將可搜索的形式將文檔全部儲到記憶體中，不需要保存文檔副本</li>
</ul>
<p>逆向索引 Inverted index</p>
<ul>
<li>Relate “term” and document</li>
<li>在文檔之外另建逆向索引</li>
<li>需要保存文檔副本</li>
</ul>
<h2 id="第25課-搜索引擎的內部結構">第25課 搜索引擎的內部結構</h2>
<p>Dictionary + Postings</p>
<ul>
<li>Dictionary: keyword set</li>
<li>Postings: term-document relationship
<ul>
<li>能即時發現包含在term中的文檔</li>
</ul>
</li>
</ul>
<p>創建dictionary</p>
<ul>
<li>怎樣選擇term?</li>
<li>預定字典 或 語素分析 或 用n-gram切分
<ul>
<li>用wikipedia標題作字典</li>
<li>語素分析，以詞性分詞</li>
</ul>
</li>
</ul>
<p>搜索引擎質素評測:</p>
<ul>
<li>Recall: 搜索結果佔相關文檔比例, coverage, 量</li>
<li>Precision: 相關文檔佔搜索結果比例, accuracy, 質</li>
</ul>
<p>參考文獻:</p>
<ul>
<li>Maxime Crochemore, Christophe Hancart, Thierry Lecrog, “Algorithm on String”, Cambridge University Press, 2007.</li>
<li>Manning D. Christopher, Raghavan Prabhakar, Schutze Hinrich, “Introduction to Information Retrieval”, Cambridge University Press, 2008.</li>
<li>Ian H. Witten, Alistair Moffat, Timothy C. Bell, “Managing Gigabytes”, Morgan Kaufmann, 1999.</li>
<li>Bruce Croft, Donald Metzler, Trevor Stroham, “Search Engines: Information Retrieval in Practice”, Addison Wesley, 2009.</li>
</ul>
<h2 id="第28課-答案範例和思路">第28課 答案範例和思路</h2>
<p>2010年:</p>
<ul>
<li>Twitter每秒處理1.2M tweets</li>
<li>橫向擴展用partitioning對數據進行分割</li>
<li>用MySQL+memcached和partitioning</li>
<li>Partitioning用tweet發表時間為準，而非user id</li>
</ul>
<h2 id="第29課-企業軟件vs-web服務">第29課 企業軟件vs web服務</h2>
<p>Web服務:</p>
<ol>
<li>低成本、高效率
<ul>
<li>不要追求100%可靠性</li>
</ul>
</li>
<li>重視可擴展性、responsiveness的設計
<ul>
<li>只需保持99%時間good responsive，將技術重點放在可擴展性上</li>
</ul>
</li>
<li>要重視開發速度
<ul>
<li>企對軟件要從半年到一年前就開始規劃，進行開發測試，web基礎設施必須能靈活應對</li>
</ul>
</li>
</ol>
<h2 id="第30課-雲vs自行構建基礎設施">第30課 雲vs自行構建基礎設施</h2>
<p>Hatena服務規模</p>
<ul>
<li>用戶1.5M，每月unique user 19M</li>
<li>每月數billion request，不包括圖像</li>
<li>Peak time traffic 850Mbps</li>
<li>實體硬件600台以上，22 racks，virtualized 1300台</li>
</ul>
<p>系統架構:</p>
<ul>
<li>App:Database ratio = 1:2</li>
<li>服務器根據請求屬性分成了面向用戶和面向爬蟲</li>
<li>架構: Reverse proxy (3), Squid (2), memcached (5), App server (5 user, 6
spider), DB (2 content, 14 entry, 2 html, 6 keyword), load balancer (2) Other
services: Search (2), categorize (1), related doc (2), hadoop (2), worker (2)</li>
</ul>
<h2 id="第31課-層和可擴展性">第31課 層和可擴展性</h2>
<p>每月1M page view可用一台server處理</p>
<h2 id="第33課-保証冗餘性">第33課 保証冗餘性</h2>
<p>應對策略: 用load balancer作failover和failback</p>
<ul>
<li>Failover: 自動離線</li>
<li>Failback: 恢複正常的server再次上線</li>
<li>Load balancer自動health check判斷是否需要failover/failback</li>
</ul>
<p>數據庫服務器</p>
<ul>
<li>Master冗餘化比較困難</li>
<li>Multi-master = 雙向replication，會有延遲和短時間內(millisecond)數據不一致</li>
<li>MySQL multimaster: 多個服務器之間用Virtual Router Redundancy Protocol互相監視
<ul>
<li>active/standby架構</li>
<li>只有一active，所有寫入均向active寫入</li>
<li>一旦active停機，standby就提升為active，變成新的master</li>
<li>修復後failback，成為新standby或重新成為active</li>
</ul>
</li>
</ul>
<p>Hatena 2009年8月，一天消耗幾十GB存儲量，一個月2TB</p>
<p>增加單台服務器的儲存量 = 保存文件數激增 = I/O性能成瓶頸</p>
<h2 id="第34課-系統穩定化">第34課 系統穩定化</h2>
<p>用保証冗餘性的方法讓系統更穩定。</p>
<p>避免用盡CPU，以免故障時處理能力不足。實際應只使用7成左右</p>
<p>Slashdot效應、digg效應: 加上squid之類的緩存服務器，結未登錄的用戶返問緩存內容</p>
<p>Example: Nintendo DSi</p>
<ul>
<li>加星星會有Super Mario吃金幣的叮鈴聲，因此孩子會加上100個、1000個星星</li>
<li>Ugo memo上線之前，全部星星有幾千萬個，上線後增大到上億。原本數GB的數據庫變成幾十GB</li>
</ul>
<h2 id="第35課-系統穩定對策">第35課 系統穩定對策</h2>
<p>為了維持適當的餘量，在極限的70%運營</p>
<p>自動DoS判斷: mod_dosdetector</p>
<ul>
<li>一分鐘之內同一IP地址發送的請求過多，就返回403</li>
</ul>
<p>資源過度使用時自動重啟系統</p>
<ul>
<li>Memory leak protection</li>
</ul>
<p>自動終止耗時查詢</p>
<ul>
<li>Kill SQL running too long</li>
<li>未能修改代碼時的臨時對策</li>
</ul>
<h2 id="第36課-虛擬化技術">第36課 虛擬化技術</h2>
<p>Hatena用Xen (CentOS 5.2, Xen 3.0.3)</p>
<ul>
<li>Local disk LVM partition</li>
<li>monit監視平均負載，一旦超過threshold，就執行Apache重啟或操作系統重啟</li>
</ul>
<p>Virtualization overhead:</p>
<ul>
<li>CPU speed: 2-3%</li>
<li>Memory throughput: 1%</li>
<li>Network throughput: 50%</li>
<li>I/O performance: 5%</li>
</ul>
<h2 id="第37課-硬件和提高效率">第37課 硬件和提高效率</h2>
<p>32GB RAM vs 8GB RAM + SSD</p>
<ul>
<li>32GB RAM可將全部數據in-memory，不會發生I/O read，只有I/O write</li>
<li>8GB RAM I/O read非常多，但因SSD，I/O wait不明顯</li>
<li>SSD效能近似於in-memory而不需用服務器專用硬件</li>
<li>兩者均達到每秒處理500次SQL SELECT的性能</li>
</ul>
<p>SSD壽命</p>
<ul>
<li>查看S.M.A.R.T. 值中的 E9 (media wearout indicator) 數值，會從100減少到0</li>
<li>用smartctl命令查看，ID 233 (Hex = E9)</li>
</ul>
<h2 id="特別篇第1課-作業隊例系統theschwartzgearman">特別篇第1課 作業隊例系統TheSchwartz、Gearman</h2>
<p>用Task queue執行那些允許延遲的處理</p>
<h2 id="特別篇第3課-緩存系統squidvarnish">特別篇第3課 緩存系統Squid、Varnish</h2>
<p>兩台squid聯合運行: 使用Inter-Cache Protocol (ICP, internet draft)</p>
<p>兩層結構:</p>
<ul>
<li>上層squid接收但不保持cache，而是將request轉發結下層squid</li>
<li>用CARP (Cache Array Routing Protocol)</li>
<li>以URL為key，轉發至適當的squid server</li>
</ul>Adrian S. Tamrighthandabacus@users.github.com第1課 大規模服務和小規模服務PACELC theorem2019-02-07T19:19:07-05:002019-02-07T19:19:07-05:00https://www.adrian.idv.hk/pacelc<p>A few years ago I learned about <a href="/2012-11-21-captheorem">ACID, BASE, and the CAP theorem</a>.
Recently I learned about an extension, the PACELC theorem. So some recap:</p>
<h2 id="acid-and-base">ACID and BASE</h2>
<p><a href="https://en.wikipedia.org/wiki/ACID_(computer_science)">ACID</a> is about database
transaction operation. It means:</p>
<ul>
<li>atomicity: transaction is either fully succeed or entirely rolled back</li>
<li>consistency: database state should never be invalidated. All necessary
constraints, triggers, cascades should be applied</li>
<li>isolation: parallel execution of transactions should as-if executed
sequentially. Incomplete transaction may even invisible to other
transactions.</li>
<li>durability: completed transactions persist, committed to non-volatile memory,
so we can even sustain from power failure</li>
</ul>
<p>The common confusion is on the term <em>consistency</em>, which here means not
self-contradicting in different part of the database state constraints. The
other use of the keyword is to mean identical data in a distributed system, or
convergence to the same state between different hosts.</p>
<p><a href="https://en.wikipedia.org/wiki/Eventual_consistency">BASE</a> transaction is at
the other extreme, which use the other meaning of consistency:</p>
<ul>
<li>basically available: system responsive even without guarantee on data consistency</li>
<li>soft-state: database state may change over time while no user update</li>
<li>eventual consistency: the soft-state of database will converge to a stable
state when data propagated to all hosts</li>
</ul>
<p>ACID is the standard mode of operation for RDBMS to guarantee data accuracy. So
usually mutex is used and hurts I/O throughput. BASE, however, is common for
NoSQL database cluster to optimize for availability and best-effort response
time while sacrificing correctness.</p>
<h2 id="cap-and-pacelc">CAP and PACELC</h2>
<p>Connecting ACID and BASE paradigm is the CAP theorem. It concerns the following
attributes of a distributed data store:</p>
<ul>
<li>consistency: read reflects the most recent write (A and D in ACID, same
meaning of consistency in BASE)</li>
<li>availability: respond to every request without above consistency guarantee</li>
<li>partition tolerance: the system continue to operate despite network failure,
e.g., partially disconnected, packet drop, severe delay</li>
</ul>
<p>We cannot avoid partition in a distributed system, so CAP theorem means a
distributed system should choose between consistency or availability. ACID
database chose consistency (refuse response if cannot check with peers) while
BASE database chose availability (respond with local data without ensuring it
is the latest with its peers).</p>
<p><a href="https://en.wikipedia.org/wiki/PACELC_theorem">PACELC theorem</a> give further
detail on what happen when there is no partitioning (i.e., network is healthy).
The acronym means if we suffer from network partitioning (P), we have to choose
between availability (A) or consistency (C), else (E) we have to choose between
latency (L) or consistency (C). The PAC is same as the CAP theorem and the ELC
is the extension.</p>
<p>The whole thesis is assuming we maintain high availability by replication. So
when there is failure, CAP theorem prevails. But if not, we still have to
consider the tradeoff between consistency and latency of a replicated system.
Now we can classify come database systems:</p>
<ul>
<li>MySQL cluster: PC+EC, such RDBMS always prioritize for consistency of data</li>
<li>Amazon DynamoDB: PA+EL, it aimed for fast respond time by trading off
consistency of data</li>
</ul>
<h2 id="reference">Reference</h2>
<p>Daniel Abadi, <a href="http://cs-www.cs.yale.edu/homes/dna/papers/abadi-pacelc.pdf">Consistency Tradeoffs in Modern Distributed Database System
Design</a>. IEEE
Computer, pp.37–42, Feburary 2012.</p>Adrian S. Tamrighthandabacus@users.github.comA few years ago I learned about ACID, BASE, and the CAP theorem. Recently I learned about an extension, the PACELC theorem. So some recap:Evolution of HTTP2019-02-06T16:55:00-05:002019-02-06T16:55:00-05:00https://www.adrian.idv.hk/http<p>In the beginning, there were HTTP. Then we started to version it. First we have
HTTP/0.9, then HTTP/1.0 (RFC1945), then HTTP/1.1 (RFC2068 in 1997), and
recently we have HTTP/2 (RFC7540 in 2015) and HTTP/3 (<a href="https://tools.ietf.org/html/draft-ietf-quic-http-18">Internet
draft</a> in 2018). So it
should be a time to review how the protocol evolved in a high level.</p>
<p>The predominant HTTP/1.1 protocol<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup> uses one TCP connection for each request, and
we can reuse the connection after a request has finished. Usually a browser
will limit limit to opening at most 6 connections at a time. Optionally, we can
run HTTP above SSL. SSL is versioned as 1.0 (the Netscape version), 2.0 (1995),
and later 3.0 (1996), which is then renamed and evolved to TLS 1.0 (RFC2245 in
1999). These are are deprecated and now we have TLS 1.1 (RFC4346 in 2006), TLS
1.2 (RFC5246 in 2008), and TLS 1.3 (RFC8446 in 2018). TLS 1.2 gradually becomes
mandatory after the European GDPR and TLS 1.3 is on the way becomes the norm.</p>
<h3 id="spdy-and-http2">SPDY and HTTP/2</h3>
<p>This is so far for now. But years back, Google is experiementing SPDY: a
revision to HTTP to allow breaking a HTTP stream into frames and interleave the
frames of different streams under the same TCP connection. This idea is getting
standardized into HTTP/2. Features are:</p>
<ul>
<li>add multiplexing and pipelining over HTTP/1.1
<ul>
<li>concurrent requests in a single connection</li>
<li>save overhead on TCP handshake, SSL</li>
<li>Standard adopted from Google SPDY spec</li>
<li>RFC7540 (HTTP/2) and RFC7541 (HPACK), May 2015</li>
</ul>
</li>
<li>“binary framing layer” introduced, not backward compatible with HTTP/1.x
<ul>
<li>a layer above TLS layer and below HTTP/2.0 application layer</li>
<li>use TLS 1.2+</li>
<li>HTTP/1.x protocol is newline delimited plaintext, HTTP/2 is smaller
messages and frames, each is encoded in a binary format</li>
</ul>
</li>
<li>server push: multiple response per single request
<ul>
<li>for example, request index.html, response with related js and css</li>
<li>send with <code>PUSH_PROMISE</code> frame to signal client ahead of what to expect so
client will not send redundant requests</li>
</ul>
</li>
<li>header compression: HPACK
<ul>
<li>implicit imply request headers from other frame (of other requests)</li>
</ul>
</li>
</ul>
<p>HTTP/2 introduced a new concept of stream to represent a flow of bytes over a
connection. Each stream comprises of multiple messages. The old HTTP request or
response is one message. Each message is a logical request or response. A
message is assembled from frames. Frames are the smallest unit of
communication. Each containing a frame header and carries a specific type of
data, such as message payload or HTTP headers. Frames are interleaved in a
stream and reassembled to recover the message.</p>
<p>With such, we can multiplex multiple message in one TCP connection and we can
use single connection to serve concurrent requests. This is called connection
coalescing and desharding. This makes the TCP connections more long-lived.
However, as there is only a single TCP, it may not perform any better than
HTTP/1.x if the packet loss rate is high.</p>
<p>HTTP/2 specification does not mandatory require TLS, but in reality, browsers only
implemented it with TLS.</p>
<h3 id="http3">HTTP/3</h3>
<p>The next step of Google’s invention after SPDY is QUIC. It uses UDP instead of
TCP for the transport and it was named as HTTP over QUIC or HTTP/2 over QUIC
but now undergoing standardization to called HTTP/3 in the internet draft.</p>
<p>QUIC <sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup> is UDP-based multiplexed and secure transport. IETF version requires TLS
1.3 (RFC8446) as the foundation for crypto and security layer (compared to TLS
1.2, requires fewer handshake round trips) but uses only “TLS messages” but not
“TLS records”. Which the standard OpenSSL API does not support without a patch
(as of now). The mandatory requirement of TLS encryption is a mean to combat
<em>protocol ossification</em> <sup id="fnref:3"><a href="#fn:3" class="footnote">3</a></sup>, so that middle boxes cannot see much of the
protocol passing through and thus force them to be agnostic about the protocol
detail.</p>
<p>The current implementation of QUIC may not work well. Experiments by Google and
FB found that it needs twice the CPU for same traffic load. May be due to Linux
not yet optimized for UDP in high speed transfer and no hardware offloading
available for TLS over UDP.</p>
<p>HTTP over QUIC is therefore using UDP instead of TCP. It worth to point out
that Google has its own version of QUIC but not interoperable with IETF’s
HTTP/3. Similar to HTTP/2, we have the concept of stream and frames but they
are on top of a datagram protocol so substantial change needed. Most notably is
QPACK is introduced to replace HPACK as the latter depends on order delivery of
streams inside TCP connection, which such in-order delivery is not guaranteed
in UDP.</p>
<h3 id="quic-and-http3-details">QUIC and HTTP/3 details</h3>
<p>QUIC:</p>
<ul>
<li>offers both 0-RTT and 1-RTT handshakes
<ul>
<li>0-RTT handshake only works if there has been a previous connection
established and a secret from that connection has been cached</li>
</ul>
</li>
<li>concept originate from TCP Fast Open (RFC7413, Dec 2014): application can
pass data to the server in the first TCP SYN packet
<ul>
<li>need OS support and network not to interfere with TCP Fast Open</li>
</ul>
</li>
<li>QUIC guarantees in-order delivery of streams but not between streams
<ul>
<li>loss packet on one stream leads to recovery operation, but other stream may proceed as usual</li>
</ul>
</li>
</ul>
<p>How QUIC works:</p>
<ul>
<li>connection: single conversation between two QUIC endpoints
<ul>
<li>connection establishment = version negotiation + cryptographic handshake</li>
<li>connection IDs: selected by one endpoint for its peer to use
<ul>
<li>allow change in addressing of lower layer (IP/UDP), i.e. migrate between
IP address and network interfaces (e.g., wifi to cellular)</li>
</ul>
</li>
<li>port numbers: UDP has 16-bit port number field</li>
</ul>
</li>
<li>streams: unidirectional or bidirectional
<ul>
<li>streams between two endpoints may run concurrently, interleaved with other
streams (inherently due to UDP transport), and can be cancelled</li>
<li>each stream are individually flow controlled
<ul>
<li>allow endpoint to limit memory commitment or apply back pressure</li>
</ul>
</li>
<li>stream is an ordered byte stream abstraction</li>
<li>stream IDs: 62-bit integer, with 2 LSBs used to identify the type of stream
<ul>
<li>LSB = initiator, client-initiated = 0, server-initiated = 1</li>
</ul>
</li>
<li>2nd LSB = direction, unidirectional = 1, bidirectional = 0</li>
<li>end point may limit the number of concurrent streams to each of its peer
<ul>
<li>by announcing max stream IDs</li>
</ul>
</li>
<li>multiple streams are not necessarily delivered in the original order
<ul>
<li>multiplexed, with no prioritization</li>
</ul>
</li>
<li>prioritization is the decision of application that uses QUIC, at HTTP layer</li>
</ul>
</li>
<li>0-RTT: allow client send data immediately without waiting for a handshake to complete
<ul>
<li>reuse parameters of the same server from cache</li>
</ul>
</li>
</ul>
<p>Operation of HTTP/3:</p>
<ol>
<li>initial connection: done over TCP with possibly parallel attempt via QUIC</li>
<li>then negotiate HTTP/2 in the first handshake</li>
<li>after connection has been set up, the server can tell client the preference
for HTTP/3</li>
<li>
<p>advertise with <code>Alt-Svc</code> (alternate service, RFC7838) header, example (HTTP/3
on UDP port 50781):</p>
<pre><code> Alt-Svc: h3=":50781"
</code></pre>
</li>
</ol>
<p>There are nine different types of HTTP/3 Frames as of Dec 18, 2018. Some of
them are:</p>
<ul>
<li>HEADERS - for compressed HTTP headers
<ul>
<li>compressed using QPACK algorithm, using two additional unidirectional
QUIC streams to carry dynamic table information in either direction</li>
</ul>
</li>
<li>DATA - for binary data contents</li>
<li>GOAWAY - signal for shutdown of this connection</li>
<li>PRIORITY - set priority and dependency on a stream
<ul>
<li>weight value 1 to 256, resources to streams are allocated proportionally
based on thier weight</li>
<li>a dependent stream should only be allocated resources if either all of
the streams that it depends on are closed or it is not possible to make
progress</li>
</ul>
</li>
<li>PUSH_PROMISE - server show what request would look like for any server-push</li>
<li>CANCEL_PUSH - client cancel of server-push</li>
</ul>
<p>For example, HTTP request is a bidirectional stream which HEADERS frame
followed by a series of DATA frames plus a possibility of final HEADERS frame
for trailers. HTTP response is returned on the same bidirectional stream,
similarly, is HEADERS + DATA + optional trailing HEADERS frame.</p>
<p>Using UDP rather than TCP has additional security consideration. One issue is
the amplification attacks and rely on clients and servers to implement
mitigation. For example, a suggestion is to enforce an initial packet of at
least 1200 bytes and the server must not send more than 3 times the size of
request before receiving a client response.</p>
<h1 id="references">References</h1>
<ul>
<li><a href="https://http3-explained.haxx.se/en/">HTTP/3 explained</a> (or <a href="https://github.com/bagder/http3-explained">github source</a>)</li>
<li><a href="https://http2-explained.haxx.se/content/en/">HTTP/2 explained</a></li>
<li><a href="https://github.com/quicwg/base-drafts/wiki/Implementations">Index of QUIC implementations</a></li>
<li><a href="https://tools.ietf.org/html/draft-ietf-quic-recovery-18">Internet draft on QUIC</a></li>
</ul>
<hr />
<div class="footnotes">
<ol>
<li id="fn:1">
<p>RFC2068 in 1997, RFC2616 in 1999, RFC7230 in 2014 <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
<li id="fn:2">
<p>QUIC originally was an acronym Quick UDP Internet Connections <a href="#fnref:2" class="reversefootnote">↩</a></p>
</li>
<li id="fn:3">
<p>ossification means forming of bones, or a protocol cannot be changed due to networking equipment assumed and requires it to operate in an old version behaviour. Any change will be considered bad or illegal by the networking equipment. <a href="#fnref:3" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>Adrian S. Tamrighthandabacus@users.github.comIn the beginning, there were HTTP. Then we started to version it. First we have HTTP/0.9, then HTTP/1.0 (RFC1945), then HTTP/1.1 (RFC2068 in 1997), and recently we have HTTP/2 (RFC7540 in 2015) and HTTP/3 (Internet draft in 2018). So it should be a time to review how the protocol evolved in a high level.Solving Sudoku with Simulated Annealing2019-01-30T19:12:32-05:002019-01-30T19:12:32-05:00https://www.adrian.idv.hk/simanneal<p>Recently I encountered a problem of mixed integer programming with linear
constraints but non-linear convex objective function. There do not seem to have
any nice solver package available in Python. Even if I tried to look for
anything with a C/C++ API, there are not many open source solutions, AFAIK.
Therefore my second attempt is to solve the problem heuristically. It turns out
the problem structure may allow a simple hill climbing (i.e., greedy search
algorithm). However, while the objective function is convex, I cannot prove the
solution domain is contiguous in all cases, hence I cannot guarantee there will
not be any local minima. So play safe and use simulated annealing can be a good
move.</p>
<p>Simulated annealing is a metaheuristic algorithm which means it depends on a
handful of other parameters to work. But a simple skeleton algorithm is as follows:</p>
<pre><code class="language-python">def simulated_annealing(s0, k_max):
s = s0
for k in range(k_max):
T = temperature(k/k_max)
s_new = neighbour(s)
if P(E(s), E(s_new), T) >= random.random():
s = s_new
return s
</code></pre>
<p>It starts searching with an initial solution <code>s0</code> at temperature <code>k_max</code>, and
depends on an annealing schedule <code>temperature(r)</code>. It keeps looking for a
randomly chosen <code>neighbour(s)</code> and accept such neighbour by a probability
function <code>P()</code>, which in turn depends on the energy level <code>E()</code> of solutions.
The acceptance probability function (e.g., in Metropolis-Hastings algorithm) is
exponential <script type="math/tex">P(e, e', T) = \exp(-\frac{e'-e}{T})</script> for <script type="math/tex">e,e'</script> to be the
energy level of <code>s</code> and its neighbour respectively. The function <code>P()</code> may
raise above 1 if the neighbour of <code>s</code> has lower energy (i.e. better) and in
such case the neighbour is accepted deterministically.</p>
<h2 id="variations-of-algorithm">Variations of algorithm</h2>
<p>Because simulated annealing is a metaheuristic algorithm, there are many
variations available. A paper, <a href="https://arxiv.org/abs/cs/0001018">Adaptive simulated annealing (ASA): Lessons
learned</a>, gave a very concise summary.</p>
<p>The original simulated annealing is modeled after Boltzmann equation. Thus
named as Boltzmann Annealing (BA). It has a solution modelled as a <script type="math/tex">d</script>-vector
<script type="math/tex">x=(x^1, \cdots, x^d)</script> with the annealing schedule <script type="math/tex">T(k)=T_0/\log k</script> for
<script type="math/tex">k</script> denote the step count. It search a solution space by the probability density</p>
<script type="math/tex; mode=display">g_T(\Delta x) = (2\pi T)^{-d/2}\exp\left(-\frac{\Delta x^2}{2T}\right)</script>
<p>where <script type="math/tex">\Delta x = x_{k+1} - x_k</script> which usually <script type="math/tex">x_k</script> is the current solution and
<script type="math/tex">x_{k+1}</script> is the neighbour. This is modelling the class of physical systems that
called the Gaussian-Markovian systems. And the acceptance probability function is</p>
<script type="math/tex; mode=display">h_T(\Delta E) = \frac{1}{1+\exp(\Delta E/T)} \approx \exp(-\Delta E/T)</script>
<p>where <script type="math/tex">\Delta E = E_{k+1}-E_k</script> and this is also the form of probability
function mentioned above. The Boltzmann Annealing is characterized by the
annealing schedule <script type="math/tex">T(k)</script>. It is proved that with such schedule, any point in
the solution space can be sampled infinitely often in annealing time, i.e.,
<script type="math/tex">\lim_{k\to\infty}\sum_k g_k = \infty</script>, or the sampling is ergodic.</p>
<p>If we switch to use the exponential schedule</p>
<script type="math/tex; mode=display">T(k) = cT(k-1) = T_0 e^{(c-1)k}</script>
<p>this will make the annealing faster but violates the assumption of ergodic
sampling in BA. But people find this useful and named this as <em>simulated
quenching</em>. Such schedule is also called the exponential cooling scheme (ECS).</p>
<p>Another variation is to use the Cauchy distribution for solution probability
density:</p>
<script type="math/tex; mode=display">g_T(\Delta x) = \frac{T}{(\Delta x^2 + T^2)^{(d+1)/2}}</script>
<p>which has a fatter tail than the Gaussian distribution as in BA. Therefore, it
is easier to get around the local minima while searching for the global minima.
However, such density allows statistically same ergodicity as BA if the
annealing schedule is <script type="math/tex">T(k) = T_0/k</script>., due to</p>
<script type="math/tex; mode=display">\lim_{k\to\infty}\sum_k g_k = \frac{T_0}{\Delta x^{d+1}}\sum_k\frac{1}{k} = \infty</script>
<p>Therefore, it is faster while similarly effective. This one is called <em>fast
annealing</em> (FA).</p>
<p>Adaptive Simulated Annealing (ASA) is a further extension that allows each
component <script type="math/tex">x^i</script> of the solution vector <script type="math/tex">x</script> to take a different domain, with
different density function, and undergo a different annealing schedule. This
expands the number of parameters to the algorithm to many folds.</p>
<h2 id="constrained-optimization">Constrained optimization</h2>
<p>The literature of simulated annealing usually do not explicitly concern about
constrained optimization. One way to apply SA to constrained optimization is to
implement the logic in the neighbour function so that each neighbour used is
always within constrain. Another way is to implement the constrain in the
objective function as penalties, e.g.</p>
<script type="math/tex; mode=display">E^{\ast}(x) = E(x) + \frac{1}{T}\sum_i w_i c_i(x)</script>
<p>for each <script type="math/tex">c_i(x)</script> the magnitude of violation in constrain <script type="math/tex">i</script> and <script type="math/tex">w_i</script>
the corresponding weight in penalty. This approach may not be a good one
especially if the valid solutions are sparse.</p>
<h2 id="parameter-estimation">Parameter estimation</h2>
<p>SA algorithm is mostly controlled by the temperate and its annealing schedule.</p>
<p>The well-received believe is that the initial temperature should be set to such
a level that there is at least 80% probability to accept an increase in the
objective function. And during the last 40% of annealing time, the algorithm
should search in proximity of the optimal solution. One way to find the initial
temperature <script type="math/tex">T_0</script> is to estimate (or observe) the average objective increase
<script type="math/tex">\Delta E^+</script> and then set <script type="math/tex">T_0 = -\frac{\Delta E^+}{\log p_0}</script> for the
initial probability of acceptance (e.g., <script type="math/tex">p_0 = 0.8</script>).</p>
<p>The annealing schedule can also be fine-tuned such that there can be more than
one neighbour searched in every temperature step <script type="math/tex">k</script>. Indeed, for a fixed
temperature, the search and transition between neighbours forms a Markov Chain.
We can designate a maximum length <script type="math/tex">L_k</script> for a Markov Chain at step <script type="math/tex">k</script> such
that we either check a total of <script type="math/tex">L_k</script> solutions or accepted
<script type="math/tex">N_{min} \le L_k</script> solutions, then the temperate should be decreased for one
step.</p>
<h2 id="sudoku">Sudoku</h2>
<p>I found a Python package <a href="https://github.com/perrygeo/simanneal">simanneal</a> for
simulated annealing. SA algorithm is simple enough that we can code our own,
this package may not be an absolute necessary. However, it is simple enough to
use. Let’s start with the code:</p>
<pre><code class="language-python">#!/usr/bin/env python
import copy
import random
from simanneal import Annealer
# from https://neos-guide.org/content/sudoku
_ = 0
PROBLEM = [
1, _, _, _, _, 6, 3, _, 8,
_, _, 2, 3, _, _, _, 9, _,
_, _, _, _, _, _, 7, 1, 6,
7, _, 8, 9, 4, _, _, _, 2,
_, _, 4, _, _, _, 9, _, _,
9, _, _, _, 2, 5, 1, _, 4,
6, 2, 9, _, _, _, _, _, _,
_, 4, _, _, _, 7, 6, _, _,
5, _, 7, 6, _, _, _, _, 3,
]
def print_sudoku(state):
border = "------+------+------"
rows = [state[i:i+9] for i in range(0,81,9)]
for i,row in enumerate(rows):
if i % 3 == 0:
print(border)
three = [row[i:i+3] for i in range(0,9,3)]
print(" |".join(
" ".join(str(x or "_") for x in one)
for one in three
))
print(border)
def init_solution_row(problem):
"""Generate a random solution from a Sudoku problem
"""
solution = []
for i in range(0, 81, 9):
row = problem[i:i+9]
permu = [n for n in range(1,10) if n not in row]
random.shuffle(permu)
solution.extend([n or permu.pop() for n in row])
return solution
def coord(row, col):
return row*9 + col
class Sudoku_Row(Annealer):
def __init__(self, problem):
self.problem = problem
state = init_solution_row(problem)
super().__init__(state)
def move(self):
"""randomly swap two cells in a random row"""
row = random.randrange(9)
coords = range(9*row, 9*(row+1))
n1, n2 = random.sample([n for n in coords if self.problem[n] == 0], 2)
self.state[n1], self.state[n2] = self.state[n2], self.state[n1]
def energy(self):
"""calculate the number of violations: assume all rows are OK"""
column_score = lambda n: -len(set(self.state[coord(i, n)] for i in range(9)))
square_score = lambda n, m: -len(set(self.state[coord(3*n+i, 3*m+j)] for i in range(3) for j in range(3)))
return sum(column_score(n) for n in range(9)) + sum(square_score(n,m) for n in range(3) for m in range(3))
def coord(row, col):
return row*9+col
def main():
sudoku = Sudoku_Row(PROBLEM)
sudoku.copy_strategy = "slice"
sudoku.steps = 1000000
print_sudoku(sudoku.state)
state, e = sudoku.anneal()
print("\n")
print_sudoku(state)
print(e)
if __name__ == "__main__":
main()
</code></pre>
<p>Instead of try with the traditional unconstrained continuous minimization
problem, we go with an unnatural choice of a constrained combinatorial problem.
We pick the sudoku problem from <a href="https://neos-guide.org/content/sudoku">https://neos-guide.org/content/sudoku</a>, and
use the <code>simanneal</code> module. The implementation is easy: Just derive a class
from <code>Annealer</code> and define two member functions <code>move()</code> and <code>energy()</code>, for
neighbour search and objective function to minimize respectively. The SA
algorithm should start with initializing the <code>Annealer</code> with an initial
solution and call <code>anneal()</code> to get the optimal state variable and its
corresponding energy level.</p>
<p>The above program try to find a permutation of 1 to 9 in each row and check if
it comes up with a solution to the sudoku problem. Every step is to pick a row
in the original state and swap two positions. The energy function is to count
the number of distinct elements in each row and square block and take the
negation of the sum. A perfect solution will be <script type="math/tex">-9\times 9\times 2 = -162</script>.
This program does not work well. It does not come up with a correct solution.
So my second attempt is not to shuffle by row but by square blocks:</p>
<pre><code class="language-python">import copy
import random
import numpy as np
from simanneal import Annealer
# from https://neos-guide.org/content/sudoku
_ = 0
PROBLEM = np.array([
1, _, _, _, _, 6, 3, _, 8,
_, _, 2, 3, _, _, _, 9, _,
_, _, _, _, _, _, 7, 1, 6,
7, _, 8, 9, 4, _, _, _, 2,
_, _, 4, _, _, _, 9, _, _,
9, _, _, _, 2, 5, 1, _, 4,
6, 2, 9, _, _, _, _, _, _,
_, 4, _, _, _, 7, 6, _, _,
5, _, 7, 6, _, _, _, _, 3,
])
def print_sudoku(state):
border = "------+-------+------"
rows = [state[i:i+9] for i in range(0,81,9)]
for i,row in enumerate(rows):
if i % 3 == 0:
print(border)
three = [row[i:i+3] for i in range(0,9,3)]
print(" | ".join(
" ".join(str(x or "_") for x in one)
for one in three
))
print(border)
def coord(row, col):
return row*9+col
def block_indices(block_num):
"""return linear array indices corresp to the sq block, row major, 0-indexed.
block:
0 1 2 (0,0) (0,3) (0,6)
3 4 5 --> (3,0) (3,3) (3,6)
6 7 8 (6,0) (6,3) (6,6)
"""
firstrow = (block_num // 3) * 3
firstcol = (block_num % 3) * 3
indices = [coord(firstrow+i, firstcol+j) for i in range(3) for j in range(3)]
return indices
def initial_solution(problem):
"""provide sudoku problem, generate an init solution by randomly filling
each sq block without considering row/col consistency"""
solution = problem.copy()
for block in range(9):
indices = block_indices(block)
block = problem[indices]
zeros = [i for i in indices if problem[i] == 0]
to_fill = [i for i in range(1, 10) if i not in block]
random.shuffle(to_fill)
for index, value in zip(zeros, to_fill):
solution[index] = value
return solution
class Sudoku_Sq(Annealer):
def __init__(self, problem):
self.problem = problem
state = initial_solution(problem)
super().__init__(state)
def move(self):
"""randomly swap two cells in a random square"""
block = random.randrange(9)
indices = [i for i in block_indices(block) if self.problem[i] == 0]
m, n = random.sample(indices, 2)
self.state[m], self.state[n] = self.state[n], self.state[m]
def energy(self):
"""calculate the number of violations: assume all rows are OK"""
column_score = lambda n: -len(set(self.state[coord(i, n)] for i in range(9)))
row_score = lambda n: -len(set(self.state[coord(n, i)] for i in range(9)))
score = sum(column_score(n)+row_score(n) for n in range(9))
if score == -162:
self.user_exit = True # early quit, we found a solution
return score
def main():
sudoku = Sudoku_Sq(PROBLEM)
sudoku.copy_strategy = "method"
print_sudoku(sudoku.state)
#sudoku.steps = 1000000
#auto_schedule = sudoku.auto(minutes=1)
#print(auto_schedule)
#sudoku.set_schedule(auto_schedule)
sudoku.Tmax = 0.5
sudoku.Tmin = 0.05
sudoku.steps = 100000
sudoku.updates = 100
state, e = sudoku.anneal()
print("\n")
print_sudoku(state)
print("E=%f (expect -162)" % e)
if __name__ == "__main__":
main()
</code></pre>
<p><a href="/img/simanneal.py">This version</a> is to find a neighbour by picking a square
block randomly and swap two positions in it. Therefore each block is guaranteed
to have 1 to 9 but we have to check each row and column for consistency. We
also add assert <code>self.user_exit</code> when a solution is found so we can have a
early termination. The performance is only slightly better (or may be not, I
didn’t collect enough statistics to claim this). And neither program ever find
a solution.</p>
<p>However, if we try to use the <code>auto()</code> function to find the schedule (the
commented lines in <code>main()</code> above), the program start to find a solution,
albeit not always. A few other modification increased the chance of reaching a
solution:</p>
<p>Firstly, in <code>simanneal</code> module, the temperature update is based on this formula:</p>
<pre><code>T = self.Tmax * math.exp(Tfactor * step / self.steps)
</code></pre>
<p>where <code>Tfactor</code> is a constant derived from <code>Tmax</code> and <code>Tmin</code> so that it
guarantees the temperate <code>T</code> goes from <code>Tmax</code> to <code>Tmin</code> over the annealing
schedule. It turns out the normal setting is having the temperature dropping
too fast. This is an adverse effect to the search that trap us in a sub-optimal
solution. If we change this to <code>T = 0.99999 * T</code>, it will be easier to reach a
solution. But we can achieve the same by gauging <code>Tmax</code> and <code>Tmin</code> to closer to
each other, so we have the second modification:</p>
<p>Secondly, we set some particular number for <code>Tmax</code> and <code>Tmin</code> (also <code>steps</code>).
They are found empirically such that a solution can be reached in a few seconds
instead of minutes. It turns out a low <code>Tmax</code> and comparable <code>Tmin</code> can
maintain a good effectiveness by keeping the temperature decrease steady. The
following chart shows the annealing schedule according to the exponential decay
formula in <code>simanneal</code> above, scaled to match the starting point at step 0. The
blue curve is using the value 0.5 and 0.05 respectively, as in the code above.
The red curve is the default value of 2500 and 2.5 respectively. The absolute
value of <code>T</code> does not matter in this chart but the ratio of <code>Tmax</code> and <code>Tmin</code>
do. The value of <code>T</code>, however, plays a role in the probability function of
acceptance and thus it set as such.</p>
<p><img src="/img/simanneal.png" alt="" /></p>
<p>How about SA with constant temperature? Make <code>Tmin</code> and <code>Tmax</code> equal. How
about simple hill-climbing instead of SA? Set <code>Tmax</code> to as close to zero as
possible so the acceptance probability function is virtually 0. But a problem
of combinatorial nature should not be suitable for hill-climbing search.</p>
<h2 id="reference">Reference</h2>
<p>Lester Ingber, <a href="https://arxiv.org/abs/cs/0001018">Adaptive simulated annealing (ASA): Lessons
learned</a>. Control and Cybernetics, 1995.
arXiv:cs/0001018</p>
<p>Franco Busetti, <a href="http://www.aiinfinance.com/saweb.pdf">Simulated annealing overview</a></p>
<p><a href="https://github.com/erichowens/SudokuSolver">Another implementation of a Sudoku solver in Python</a></p>Adrian S. Tamrighthandabacus@users.github.comRecently I encountered a problem of mixed integer programming with linear constraints but non-linear convex objective function. There do not seem to have any nice solver package available in Python. Even if I tried to look for anything with a C/C++ API, there are not many open source solutions, AFAIK. Therefore my second attempt is to solve the problem heuristically. It turns out the problem structure may allow a simple hill climbing (i.e., greedy search algorithm). However, while the objective function is convex, I cannot prove the solution domain is contiguous in all cases, hence I cannot guarantee there will not be any local minima. So play safe and use simulated annealing can be a good move.