Jekyll2022-01-16T17:49:47-05:00https://www.adrian.idv.hk/feed.xml∫ntegrabℓε ∂ifferentiαℓsunorganised memo, notes, code, data, and writings of random topicsAdrian S. Tamrighthandabacus@users.github.comMultivariate Outlier Detection2022-01-14T00:00:00-05:002022-01-14T00:00:00-05:00https://www.adrian.idv.hk/outlier<p>For an observation of multidimensional variable \(x_i\) and the set of
observations \(X\), the Mahalanobis distance tells how far \(x_i\) is from the
center of the data with the shape of the dataset considered (i.e., far from
center but with multiple samples in the proximity of Euclidean space is fine,
but if it is alone in the proximity is an outlier). The Mahalanobis distance is
defined as</p>
\[\text{MD}_i = \sqrt{(x_i-\bar{x})^\top V^{-1} (x_i - \bar{x})}\]
<p>where \(V\) is the sample covariance matrix and \(\bar{x}\) is the sample mean.
Euclidean distance will consider only \(\bar{x}\), but it is the covariance
matrix \(V\) to tell how much variation you have on each dimension. The
Mahalanobis distance is scale-invariant. It roughly tells how many standard
deviations the point \(x_i\) is from the center \(\bar{x}\). For
\(p\)-dimensional observations, we can set the cutoff for outlier to be
\(\text{MD}_i^2 > \chi_{p,1-\alpha/2}^2\) where \(\chi_{p,v}^2\) is the inverse
CDF at value \(v\) of \(\chi^2\) distribution with dof \(p\).</p>
<p>The Mahalanobis distance will suffer from masking effect, i.e. multiple
outliers may not produce large distance. To solve this, we need sample mean and
sample covariance matrix to be robust in case of adding outliers. Rousseeuw
and Zomeren (1990) proposed minimum volume ellipsoid (MVE), which replaced
\(\bar{x}\) and \(V\) with a vector \(T\) and positive semidefinite matrix
\(C\) and define the <em>robust distnace</em>,</p>
\[\text{RD}_i = \sqrt{(x_i-T)^\top C^{-1} (x_i - T)}\]
<p>The vector \(T\) and matrix \(C\) is found such that</p>
\[\begin{aligned}
\min\quad & \text{det} C \\
\text{subject to } &
\big\lvert \big\{i : (x_i - T)^\top C^{-1} (x_i - T) \le a^2 \big\}\big\rvert \ge \frac{n+p+1}{2}
\end{aligned}\]
<p>where \(a^2\) is a constant chosen to be \(\chi_{p,0.5}^2\) if assuming majority
of the data are from normal distribution. The observation is \(p\) dimensional
and the dataset has \(n\) observations. The MVE has breakdown point
approximately at 50%, as the above optimization problem is constrained by half
of the data points are with short robust distance.</p>
<p>Solving the above optimization problem can be computationally expensive. Hence
Gao et al (2005) proposed Max-Eigen difference (MED). The algorithm is as follows:</p>
<ol>
<li>For dataset \(X\), find the sample mean \(\bar{x}\) and covariance matrix
\(V\), then define \(\lambda_1\ge \cdots \ge \lambda_p\) as the eigenvalues of
\(V\) and their corresponding eigenvectors are \(v_1,\cdots,v_p\)</li>
<li>Define \(X^{(i)}\) as the dataset with observation \(x_i\) removed. We find
covariance matrix \(V^{(i)}\) of \(X^{(i)}\) its eigenvalues
\(\lambda_1^{(i)}\ge \cdots \ge\lambda_p^{(i)}\) and eigenvectors \(v_1^{(i)},\cdots,v_p^{(i)}\)</li>
<li>Define distance
\[ d_i = \lVert \lambda_1^{(i)} v_1^{(i)} - \lambda_1 v_1\rVert \Big(1-\prod_{j=1}^p \mathbb{I}[ y_{ij}^2 \lt \lambda_j]\Big) \]
where \(y_{ij} = (x_i - \bar{x})^\top v_j\) and \(\mathbb{I}[\cdot]\) is
the indicator function. Then the MED is \(d_i\) normalized,</li>
</ol>
\[\text{MED}_{i} = \frac{d_i}{ \sum_{j=1}^n d_j}\]
<p>This works because the outliers will deeply influence the first eigenvalue and
eigenvector.</p>
<p>These are not the only way to find outliers. Kannan and Manoj (2015) provided a
survey of some other method as well. For example, the Cook’s distance is to
consider a linear regression setting \(y_i \sim x_i\) and asks how one point
can influence the least square estimate. This distance can be written in
different ways:</p>
\[\begin{aligned}
D_i &= \frac{1}{pMSE} \sum_{j=1}^n (\hat{y}_i - \hat{y}_{j(i)} \\
D_i &= \frac{e^2}{pMSE}\left(\frac{h_{ii}}{(1-h_{ii})^2}\right) \\
D_i &= \frac{(\hat{\beta}-\hat{\beta}^{-i})^\top (X^\top X) (\hat\beta - \hat\beta^{-i})}{(1+p)s^2}
\end{aligned}\]
<p>where the model is \(y = \beta^\top x\) and \(\hat\beta\) is the least square
estimate of \(\beta\). The \(\hat\beta^{-i}\) is the same estimate with
observation \(x_i\) removed. Similarly \(\hat{y}_j = \hat\beta^\top x_j\) and
\(\hat{y}_{j(i)} = \hat\beta^{-i\top} x_j\). The quantity \(h_{ii}\) is the
\(i\)-th diagonal element of the hat matrix</p>
\[H = X(X^\top X)^{-1} X^\top\]
<p>In fact, \(h_{ii}\) is the leverage score in which it is high if \(x_i\) is an
outlier. \(h_{ii}\in[0,1]\) and we can use the cutoff \(h_{ii}>3p/n\) to
determine an outlier.</p>
<p>DFFITS is another measure similar to Cook’s distance. Consider the same linear
regression setting, it tells how much the regression function changes if
observation \(x_i\) is removed. The metric is</p>
\[\text{DFFITS}_i = \frac{\hat{y}_i - \hat{y}_{i(i)}}{\sigma_{(i)}\sqrt{h_ii}}\]
<p>This metric should less than 1 for small samples and it should be less than
\(2\sqrt{p/n}\) for large samples. Otherwise the points should be check for
outlier.</p>
<h3 id="univariate-case">Univariate case</h3>
<p>The case of identifying outliers in univariate samples are much easier as there
is only one dimension to consider. Manoj and Kannan (2013) has another survey.</p>
<p>The simplest one is the quantile method, which defines the outlier as those with
value below \(Q_1 - 1.5 \text{IQR}\) and those above \(Q_3 + 1.5\text{IQR}\).</p>
<p>Grubbs test is for the hypothesis (\(H_1\)) that the dataset has at least one
outlier against the null hypothesis (\(H_0\)) that there are no outlier. The
statistic depends on the sample mean \(\bar{x}\) and sample standard deviation
\(s\):</p>
\[G = \frac{\max_i \vert x_i - \bar{x}_i\vert}{s}\]
<p>Bernard Rosner (1983) take this one step further to define the <em>generalized
extreme studentized deviate (ESD) test</em>. It is a test comparing the null
hypothesis of no outlier against the hypothesis of up to \(r\) outliers. Define
the statistic similar to Grubbs’:</p>
\[R_i = \frac{\max_i \vert x_i - \bar{x}_i\vert}{s}\]
<p>and find the observation \(x_i\) that maximized \(R_i\) and remove it from the
dataset. Repeat this process for \(r\) times to remove the top \(r\)
observations. The corresponding statistics are denoted as \(R_1,\cdots,R_r\).
The test is then defining the critical region:</p>
\[\lambda_i = \frac{(n-i)t_{p,n-i-1}}{\sqrt{n-i-1+t^2_{p,n-i-1}}(n-i+1)}\]
<p>where \(i=1,2,\cdots,r\) and \(t_{p,v}\) is the inverse CDF at value \(p\) of
the \(t\) distribution with \(v\) dof. We set</p>
<p>\(p = 1-\frac{\alpha}{2(n-i-1)}\). The removed observation \(x_i\) is an
outlier if \(R_i > \lambda_i\).</p>
<p>For an easier test, Doxin (1950) proposed the following: Arrage the
observations in order \(x_1\le x_2 \le \cdots \le x_n\), and depends on the
size \(n\), define</p>
\[\begin{aligned}
R_{10} &= \frac{x_n - x_{n-1}}{x_n - x_1} & \text{for }&3\le n\le 7 \\
R_{11} &= \frac{x_n - x_{n-1}}{x_n-x_2} && 8\le n\le 10\\
R_{21} &= \frac{x_n-x_{n-1}}{x_n-x_2} && 11\le n\le 13 \\
R_{22} &= \frac{x_n-x_{n-2}}{x_n-x_3} && 14\le n\le 30
\end{aligned}\]
<p>and this value will exceed some critical value if \(x_n\) is an outlier (too
large). Apply the same formula with the order \(x_1\ge x_2 \ge \cdots \ge x_n\)
will find if \(x_n\) is an outlier at the lower end (too small).</p>
<p>Finally, the Hampel method uses the median instead of mean. Denote the median
of the dataset as \(M_x\) and the deviation from median as \(r_i = x_i - M_x\),
then the median of the absolute deviation is denoted as \(M_{\vert r\vert}\).
The observation \(x_i\) is an outlier if we found \(\vert r_i\vert \ge 4.5
M_{\vert r\vert}\).</p>
<h3 id="references">References</h3>
<ul>
<li>Peter J. Rousseeuw and Bert C. van Zomeren (1990) Unmasking Multivariate Outliers and Leverage Points. Journal of the American Statistical Association, 85(411):633-639</li>
<li>Shaogen Gao, Guoying Li, and Dongqian Wang (2005) A New Approach for Detecting Multivariate Outliers. Communications in Statistics — Theory and Models, 34:1857-1865</li>
<li>K. Senthamarai Kannan and K. Manoj (2015) Outlier Detection in Multivariate Data. Applied Mathematical Sciences, 9(47):2317-2324</li>
<li>K. Senthamarai Kannan and K. Manoj (2013) Comparison of methods for detecting outliers. International Journal of Scientific & Engineering Research, 4(9):709-714</li>
<li>F. E. Grubbs (1969) Procedures for detecting outlying observations in samples. American Statistical Association and American Society for Quality. Technometrics, 11:1-21</li>
<li>Bernard Rosner (1983) Percentage Points for a Generalized ESD many-outlier procedure. Technometrics 25(2):165-172</li>
<li>W. J. Doxin (1950) Analysis of extreme values. The Annals of Mathematical Statistics, 21(4):488-506</li>
</ul>Adrian S. Tamrighthandabacus@users.github.comFor an observation of multidimensional variable \(x_i\) and the set of observations \(X\), the Mahalanobis distance tells how far \(x_i\) is from the center of the data with the shape of the dataset considered (i.e., far from center but with multiple samples in the proximity of Euclidean space is fine, but if it is alone in the proximity is an outlier). The Mahalanobis distance is defined asBuilding tensorflow 2.7 in Debian2021-12-14T00:00:00-05:002021-12-14T00:00:00-05:00https://www.adrian.idv.hk/tensorflowbuild<p>If you just want to use tensorflow, nothing can be easier than running <code>pip
install tensorflow</code> to install it. If for any reason you need to recompile it
from source code (in Linux), this is what to do.</p>
<h2 id="dependencies">Dependencies</h2>
<p>You will depend on some libraries and tools. Most importantly, you need to add CUDA sources from nVidia.
But for Debian, you won’t get as much CUDA packages as Ubuntu. But it is fine,
you can still install packages for Ubuntu 20.04 on Debian 11. Hence just add
Ubuntu repository to the apt-get system.</p>
<p>The packages to install are</p>
<pre><code>apt install bazel bazel-3.7.2
apt install libnvinfer-dev libnvinfer-plugin-dev libnccl-dev
</code></pre>
<h2 id="source">Source</h2>
<p>To get tensorflow source, you can simply</p>
<pre><code>git clone https://github.com/tensorflow/tensorflow
</code></pre>
<p>But remember to check out the tag for the version to compile. For example,</p>
<pre><code>git checkout v2.7.0
</code></pre>
<h2 id="build-process">Build process</h2>
<p>First, we need to run</p>
<pre><code>./configure
</code></pre>
<p>and set to use CUDA, TensorRT, but do not use clang as it seems can’t compile
successfully in Debian. If not the case, remember to symlink <code>python</code> to
<code>python3</code> as it is required for the build script.</p>
<p>Then the actual compilation is from the following command:</p>
<pre><code>TMP=/tmp bazel build --config=mkl --config=cuda --config=opt --verbose_explanations --verbose_failures --jobs=6 //tensorflow/tools/pip_package:build_pip_package
</code></pre>
<p>In my system of 16GB memory, we need <code>--jobs=6</code> or the compilation will run out of memory by having too many jobs running concurrently.</p>
<p>The compilation will take hours to complete. Afterwards, we can verify the executable is built with</p>
<pre><code>ls ./bazel-bin/tensorflow/tools/pip_package/build_pip_package
</code></pre>
<p>and then we can run the following to build the wheel package (add
<code>--nightly_flag</code> if not on a tagged version):</p>
<pre><code>./bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg
</code></pre>
<p>which the wheel package is stored at <code>/tmp/tensorflow_pkg</code> and we can install it with <code>pip install</code></p>
<p>After this is done, we can clean up the file generated by the build process at
<code>~/.cache</code>, which takes around 20GB in space.</p>
<h2 id="references">References</h2>
<p>It is useful to reference to the following official documentation on how to
build Tensorflow:</p>
<ul>
<li>https://www.tensorflow.org/install/gpu#linux_setup</li>
<li>https://www.tensorflow.org/install/source</li>
</ul>Adrian S. Tamrighthandabacus@users.github.comIf you just want to use tensorflow, nothing can be easier than running pip install tensorflow to install it. If for any reason you need to recompile it from source code (in Linux), this is what to do.Tensorflow.js quick start2021-12-11T00:00:00-05:002021-12-11T00:00:00-05:00https://www.adrian.idv.hk/tensorflowjs<p>Tensorflow.js is a way to run tensorflow model in Javascript, or simply your
browser. It is huge but not as huge as the Python tensorflow itself. The way we
use it is first, to load the 1.2MB js file from the CDN at anywhere in the HTML:</p>
<pre><code class="language-html"><script src="https://cdn.jsdelivr.net/npm/@tensorflow/tfjs@3.12.0/dist/tf.min.js" integrity="sha256-Yl5oUVtHQ3wqFAPCSZmKxzSb/uZt+xzdT9mDPwwNYbk=" crossorigin="anonymous"></script>
</code></pre>
<p>and then a global JavaScript object <code>tf</code> is loaded. Next we need to run the following in JavaScript:</p>
<pre><code class="language-javascript">tf.loadLayersModel("modelpath/model.json").then(function(model) {
window.model = model;
});
</code></pre>
<p>where <code>modelpath/model.json</code> is a path relative to the current HTML. It is
generated by a converter that came with the Tensorflow.js. The key here is the
Javascript promise function <code>then()</code>, which will assign the model to the
current window’s property. This is just a convention to call this property
<code>model</code> and obviously we can name it something else especially if there are
multiple models to load.</p>
<p>The way it should be invoked is</p>
<pre><code class="language-javascript">window.model.predict([tf.tensor(x).reshape(n1,n2,n3)]).array().then(
function(output) {
....
}
)
</code></pre>
<p>The input should be converted into a tensor by <code>tf.tensor()</code> function, and
often it should also be reshaped to an appropriate dimension for the model. The
<code>model.predict()</code> function will take time to run, hence a promise function
should be created as well to process the output.</p>
<p>So how should we create the model at the first place? It should be natural to
have the model developed in Python as it should be more convenient for
experimentation and refinement. As an example, we can try to train LeNet5 for
MNIST handwritten digit recognition:</p>
<pre><code class="language-python">import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf
from tensorflow.keras.datasets import mnist
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, Dense, AveragePooling2D, Dropout, Flatten
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.callbacks import EarlyStopping
# Load MNIST data
(X_train, y_train), (X_test, y_test) = mnist.load_data()
# Reshape data to shape of (n_sample, height, width, n_channel)
X_train = np.expand_dims(X_train, axis=3).astype('float32')
X_test = np.expand_dims(X_test, axis=3).astype('float32')
# One-hot encode the output
y_train = to_categorical(y_train)
y_test = to_categorical(y_test)
# LeNet5 model
model = Sequential([
Conv2D(6, (5,5), input_shape=(28,28,1), padding="same", activation="tanh"),
AveragePooling2D((2,2), strides=2),
Conv2D(16, (5,5), activation="tanh"),
AveragePooling2D((2,2), strides=2),
Conv2D(120, (5,5), activation="tanh"),
Flatten(),
Dense(84, activation="tanh"),
Dense(10, activation="softmax")
])
# Training
model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"])
earlystopping = EarlyStopping(monitor="val_loss", patience=4, restore_best_weights=True)
model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=100, batch_size=32, callbacks=[earlystopping])
model.save("lenet5.h5")
</code></pre>
<p>This code will train and save the LeNet5 model in HDF5 format. For tensorflow.js, we need to install some tools:</p>
<pre><code>pip install tensorflowjs
</code></pre>
<p>This will install the Python tools for tensorflow.js and then we can run this to do the conversion:</p>
<pre><code>tensorflowjs_converter --input_format keras_saved_model lenet5.h5 lenetjsmodel
</code></pre>
<p>The format must be <code>keras_saved_model</code> if we have the Keras model saved using the <code>save()</code> function. The last argument is the directory name for the tensorflow.js model. This command will produce the following files</p>
<pre><code>lenetjsmodel/group1-shard1of1.bin
lenetjsmodel/model.json
</code></pre>
<p>and the json file is what you provide as argument to <code>tf.loadLayerModel()</code></p>
<p>As an example, this is what you would do to implement this on a web page, which
uses HTML5 canvas for the handwritten digit:</p>
<pre><code class="language-html"><!doctype html>
<html lang="en">
<head>
<title>MNIST Recognition</title>
<style>
#container {
border: 3px solid #fff;
padding: 10px;
width: 655px;
margin: 0 auto; /* center */
}
#canvas, #result {
width: 300px;
height: 300px;
margin: auto;
border: 3px solid #7f7f7f;
float: left;
padding: 10px;
font-size: 120px;
text-align: center;
vertical-align: middle;
}
#reset {
padding: 10px;
text-align: center;
}
#button {
clear: both;
text-align: center;
}
h1 {
margin: 10px;
text-align: center;
}
</style>
</head>
<body>
<script src="https://code.jquery.com/jquery-3.6.0.min.js" integrity="sha256-/xUj+3OJU5yExlq6GSYGSHk7tPXikynS7ogEvDej/m4=" crossorigin="anonymous"></script>
<script src="https://cdn.jsdelivr.net/npm/@tensorflow/tfjs@3.12.0/dist/tf.min.js" integrity="sha256-Yl5oUVtHQ3wqFAPCSZmKxzSb/uZt+xzdT9mDPwwNYbk=" crossorigin="anonymous"></script>
<h1>MNIST tfjs test</h1>
<div>
<div id="container">
<canvas id="canvas"></canvas>
<div id="result"></div>
</div>
</div>
<div id="button">
<button id="reset">Reset</button>
</div>
<div id="debug">
<div>
Input:
<span id="lastinput"></span>
</div>
<div>
Result:
<span id="lastresult"></span>
</div>
</div>
<script>
// Load tensorflow model
tf.loadLayersModel("lenetjsmodel/model.json").then(function(model) {
window.model = model;
});
var predict = function(input) {
if (window.model) {
window.model.predict([
tf.tensor(input).reshape([1,28,28,1])
]).array().then(function(scores) {
scores = scores[0]; // convert 2D output into 1D
$("#lastresult").html(scores.map(function(x){return Number(x.toFixed(3))}).toString());
var predicted = scores.indexOf(Math.max(...scores));
$("#result").html(predicted);
});
} else {
// didn't have the model loaded yet? try again 30sec later
setTimeout(function(){
predict(input);
}, 30);
};
};
// Trigger drawing on canvas
var canvas = document.getElementById("canvas");
var compuetedStyle = getComputedStyle(document.getElementById("canvas"));
canvas.width = parseInt(compuetedStyle.getPropertyValue("width"));
canvas.height = parseInt(compuetedStyle.getPropertyValue("height"));
var context = canvas.getContext("2d"); // to remember drawing
context.strokeStyle = "#FF0000"; // draw in bright red
context.lineWidth = 20; // Will downsize to 28x28, so must be thick enough
var mouse = {x:0, y:0}; // to remember the coordinate w.r.t. canvas
var onPaint = function() {
// event handler for mousemove in canvas
context.lineTo(mouse.x, mouse.y);
context.stroke();
};
$("#reset").click(function() {
// on button click, clear the canvas and result
$("#lastresult").html("");
$("#result").html("");
context.clearRect(0, 0, canvas.width, canvas.height);
});
// HTML5 Canvas mouse event
canvas.addEventListener("mousedown", function(e) {
// mousedown, begin path at mouse position
context.moveTo(mouse.x, mouse.y);
context.beginPath();
canvas.addEventListener("mousemove", onPaint, false);
}, false);
canvas.addEventListener("mousemove", function(e) {
// mousemove remember position w.r.t. canvas
mouse.x = e.pageX - this.offsetLeft;
mouse.y = e.pageY - this.offsetTop;
}, false);
canvas.addEventListener("mouseup", function(e) {
// Stop canvas from further update, then read drawing into image
canvas.removeEventListener("mousemove", onPaint, false);
var img = new Image(); // on load, this will be the canvas in same WxH
img.onload = function() {
// Draw this to 28x28 at top left corner of canvas so we can extract it back
context.drawImage(img, 0, 0, 28, 28);
// Extract data: Each pixel becomes a RGBA value, hence 4 bytes each
var data = context.getImageData(0, 0, 28, 28).data;
var input = [];
for (var i=0; i<data.length; i += 4) {
// scan each pixel, extract first byte (R component)
input.push(data[i]);
};
var debug = [];
for (var i=0; i<input.length; i+=28) {
debug.push(input.slice(i, i+28).toString());
};
$("#lastinput").html(debug.join("<br/>"));
predict(input);
};
img.src = canvas.toDataURL("image/png"); // convert canvas to img and trigger onload()
}, false);
</script>
</body>
</html>
</code></pre>Adrian S. Tamrighthandabacus@users.github.comTensorflow.js is a way to run tensorflow model in Javascript, or simply your browser. It is huge but not as huge as the Python tensorflow itself. The way we use it is first, to load the 1.2MB js file from the CDN at anywhere in the HTML:Abe & Nakayama (2018) Deep Learning for Forecasting Stock Returns in the Cross-Section2021-11-30T22:01:11-05:002021-11-30T22:01:11-05:00https://www.adrian.idv.hk/an18-crosssection<p>A paper to study cross-section return, i.e., return of multiple securities at
the same point in time. The models are trivial but good to learn about its
approach to the problem.</p>
<p>The evaluation in this paper is in two parts. First is to consider the
constituents of the MSCI Japan index in Jan 2017, which there are 319 stocks
covering 85% of the free float-adjusted market cap. A total of 25 factors are
collected, including the accounting ratios and market metrics (PB, PE,
earning/price, sales/price, ROE, ROA, currenct rati, equity ratioo, asset
growth, EPS, MV, beta, volatility, past year return, trading turnover, etc). To
make things realistic, some of these features are lagged for 4 months. The goal
is to make a regression model, that use data from \(T,T-3,T-6,T-9,T-12\) to
predict for the return of \(T+1\). So there will be \(25\times 5=125\) input
features to predict for a single dimensional output.</p>
<p>The training and prediction are using 120+1 months data. The author collected
25 years of data and with a sliding window of 121 months. A model is trained
using first 120 months and predict for the final 1 month. There are 180 windows
for the evaluation. MSE and Spearman correlation coefficients are used as the
metric to compare models. The use of MSE is natural for a regression problem.
The Spearman is to consider that a long-short portfolio strategy is to
rebalance by swaping the underperformers with overperformers. Hence the ranking
is more important than the actual level of return.</p>
<p>Models considered are fully-connected neural networks (various depth and
configuration), SVR, and random forest. Traditional 3-layer network are
involved with varying hidden units of 70 to 120. Also tried are 5-layer and
8-layer networks in pyramid order of hidden neurons. For deep network, 50% drop
out is applied. So as in 3-layer network, the paper also tried with 244 to 399
neurons in hidden layer and applied 50% dropout. The networks are using tanh
activation. SVR are using RBF kernel with all combinations of \(C=0.1,1,10\)
and \(\gamma=10^{-4}, 10^{-3}, 10^{-2}, 10^{-1}\) and
\(\epsilon=10^{-2},10^{-1}\). Random forest are to tried with 5 to 35 features
andd 3 to 20 depth. Both SVR and RF are using sklearn while neural network is
using Tensorflow.</p>
<p>The result is to prefer a 8-layer deep network with 120-120-70-70-20-20 neurons
over other neural network configurations, based on correlation and MSE. Note
that the correlation is only at a bit less than 0.06, which is not particularly
strong. SVR prefers \(C=0.1,\gamma=0.01,\epsilon=0.1\) and RF prefers 25
features with 7 depth.</p>
<p>The paper also evaluated the classification problem (market direction). All
models can produce an accuracy of around 52 to 54 percent. It is not too high
but confirmed (using t-test?) to be significantly above the null hypothesis of
50%. The paper als verified that making an ensemble of the best NN, SVR and RF
model outperforms any of them.</p>
<p>The second part of the evaluation is applying the long-short portfolio
strategy. Portfolio is measured using the Sharpe ratio (return/risk). It is
using the net-zero investment by buying the top tertile (3) or quintile (5) and
selling the bottom tertile or quintile. It turns out the different machine
learning model achieved similar return and risk level but neural network can do
slightly better.</p>Adrian S. Tamrighthandabacus@users.github.comA paper to study cross-section return, i.e., return of multiple securities at the same point in time. The models are trivial but good to learn about its approach to the problem.Solutions to LaTeX out of memory2021-11-28T17:11:05-05:002021-11-28T17:11:05-05:00https://www.adrian.idv.hk/latex<p>LaTeX as a decades old system should not use too much memory. But sometimes, we
will see it run out of memory. There are various solutions to this. Here are
what I tried.</p>
<p>My recent experience in making LaTeX run out of memory is to use <code>pgfplots</code> —
with a complicated function and a lot of data points, it will eat up all the
memory inside the LaTeX stack. However, it doesn’t mean your system failed to
malloc. Rather, it just means the TeX stack has overflown.</p>
<p>In the old time, I remember LaTeX allows to allocate more memory at run. If we
run latex command with an extra options:</p>
<pre><code>latex -extra_mem_top=10000000 -extra_mem_bot=10000000 file.tex
</code></pre>
<p>it should run with a bigger stack (hence more elements per page is allowed).
However, as I can confirm with the TeX memory usage statistic upon out of
memory error, it seems to me that these options are not having any effect. They
are simply ignored. As far as in TeXLive 2021 on mac.</p>
<p>The other way to get more memory is specific to MiKTeX. This is the command
line that provided by Chapter 6 of the pgfplots manual:</p>
<pre><code class="language-shell">pdflatex
--stack-size=n --save-size=n
--main-memory=n --extra-mem-top=n --extra-mem-bot=n
--pool-size=n --max-strings=n
</code></pre>
<p>For TeXLive, we have to modify the <code>texmf.cnf</code> file for configurations. The
exact path can be found by <code>kpsewhich texmf.cnf</code> and this is what we can add to
the file:</p>
<pre><code class="language-tex">% newly created file ~/texmf/mytexcnf/texmf.cnf:
% If you want to change some of these sizes only for a certain TeX
% variant, the usual dot notation works, e.g.,
% main_memory.hugetex = 20000000
main_memory = 230000000 % words of inimemory available; also applies to inimf&mp
extra_mem_top = 10000000 % extra high memory for chars, tokens, etc.
extra_mem_bot = 10000000 % extra low memory for boxes, glue, breakpoints, etc.
save_size = 150000 % for saving values outside current group
stack_size = 150000 % simultaneous input sources
% Max number of characters in all strings, including all error messages,
% help texts, font names, control sequences. These values apply to TeX and MP.
pool_size = 1250000
% Minimum pool space after TeX/MP's own strings; must be at least
% 25000 less than pool_size, but doesn't need to be nearly that large.
string_vacancies = 90000
% Maximum number of strings.
max_strings = 100000
% min pool space left after loading .fmt
pool_free = 47500
% Extra space for the hash table of control sequences (which allows 10K
% names as distributed).
hash_extra = 200000
</code></pre>
<p>We should run <code>sudo texhash</code> or <code>sudo fmtutil-sys --all</code> after the <code>texmf.cnf</code>
is updated.</p>
<p>A final method is to use luatex instead of pdflatex, for luatex will expand
memory automatically. This can be simply done by replacing the executable
<code>pdflatex</code> with <code>lualatex</code>. It may run slower sometimes but it will allocate
memory automatically without the need to specify the size before start.</p>
<p>Specific to pgfplots, we can also <em>externalize</em> the plots to save memory. It
works sometimes when the page itself is competeing with pgfplots for the
memory. It will not work if the pgfplots itself is large enough to blow up the
memory usage. To use externalization, add this to the preamble:</p>
<pre><code class="language-latex">\usepgfplotslibrary{external}
\tikzexternalize[prefix=path/to/tempfiles/,shell escape=-enable-write18]
\tikzset{external/system call= {pdflatex -save-size=80000
-pool-size=10000000
-extra-mem-top=50000000
-extra-mem-bot=10000000
-main-memory=90000000
\tikzexternalcheckshellescape
-halt-on-error
-interaction=batchmode
-jobname "\image" "\texsource"}}
</code></pre>
<p>which the <code>prefix=</code> at <code>tikzexternalize</code> is a path to prepend to figure files
that the externalize is to generate. Externalization works because it ask the
pgfplots to save the file into separate PDF first and include them in the final
output. Hence a separate stack and LaTeX process is launched for each figure.</p>
<p>A final note: It is extremely easy to blow up memory with pgfplots if making a
3D parametric plot. The reason is for a sample of \(N\), it expects for a
sample on variables <code>x</code> and <code>y</code>, hence a total of \(N^2\) sample points are
generated for each plot. If you did not use <code>y</code> as a parameter, still \(N^2\)
sample points are generated. Therefore it is crucial to specify the <code>y domain</code>
in case no <code>y</code> has used. Below is an example:</p>
<pre><code class="language-tex">\begin{tikzpicture}
\begin{axis}[
xmin=-6,xmax=6,ymin=-6,ymax=6,zmin=-1,zmax=1,
view={20}{60},
grid,
samples=300
]
\addplot3[domain=-180:180, y domain=0:0, blue] (
{(4+sin(20*x))*cos(x)},
{(4+sin(20*x))*sin(x)},
{cos(20*x)}
);
\end{axis}
\end{tikzpicture}
</code></pre>Adrian S. Tamrighthandabacus@users.github.comLaTeX as a decades old system should not use too much memory. But sometimes, we will see it run out of memory. There are various solutions to this. Here are what I tried.Taylor & Letham (2018) Forecasting at Scale2021-11-08T15:08:19-05:002021-11-08T15:08:19-05:00https://www.adrian.idv.hk/tl17-prophet<p>This is the paper for <a href="https://facebook.github.io/prophet/">Facebook Prophet</a>.
It considers time series \(y(t)\) as a composition of trend, seasonality, and
holidays under generalized additive model (GAM):</p>
\[y(t) = g(t) + s(t) + h(t) + \epsilon_t\]
<p>which the trend \(g(t)\) is non-periodic, the seasonality \(s(t)\) is periodic,
and holiday \(h(t)\) is the effect of holidays which occur irregularly. The
error term \(\epsilon_t\) is assumed Gaussian. Fitting new components in GAM
can be done using L-BFGS. Prophet is a curve-fitting model; in contrary to
ARIMA which is a generative one. Hence data needd not to be regularly spaced
and we do not need interpolation for missing data.</p>
<p>Prophet allows trend \(g(t)\) to be nonlinear and satuating. The basic form is</p>
\[g(t) = \frac{C}{1+\exp(-k(t-m))}\]
<p>which the growth rate is \(k\), time offset is \(m\), and growth ceiling is \(C\) (i.e., the
capacity). The extension is a piecewise logistic growth model</p>
\[g(t) = \frac{C(t)}{1+\exp(-(k+\mathbf{a}(t)^T\mathbf{\delta})(t-(m+\mathbf{a}(t)^T\mathbf{\gamma})))}\]
<p>where \(C(t)\) is to model the non-constant capacity, and \(\mathbf{a}(t)\) is
to model the change points at which the growth rate updated. In detail, it is a vector that</p>
\[a_j(t) = \begin{cases}1 & \text{if }t\ge s_j\\ 0 & \text{otherwise}\end{cases}\]
<p>and at time \(s_j\) the growth rate change by \(\delta_j\), hence at any time, the growth rate is given by</p>
\[k + \sum_{j: t> s_j} \delta_j\]
<p>and the time offset should be changed accordingly, which</p>
\[\gamma_j = \big(s_j - m - \sum_{i<j} \gamma_i\big)\big(1-\frac{k+\sum_{i<j}\delta_i}{k+\sum_{i\le j}\delta_i}\big)\]
<p>For a linear trend, it can be simplified into (with \(\gamma_j = -s_j\delta_j\)):</p>
\[g(t) = (k+\mathbf{a}(t)^T\mathbf{\delta})t + (m+\mathbf{a}(t)^T\mathbf{\gamma})\]
<p>The model of rate change has an implication in forecasting. The paper suggested
a prior of \(\delta_j \sim \text{Laplace}(0,\tau)\) and the parameter \(\tau\)
is fitted with data. Then the change and change point will be forecasted as a
simulated stochastic process.</p>
<p>The seasonality is approximated using Fourier series model:</p>
\[s(t) = \sum_{n=1}^N \big( a_n \cos \frac{2\pi nt}{P} + b_n \sin \frac{2\pi nt}{P}\big)\]
<p>in the paper, it is proposed to fit the Fourier model by using a seasonality vector:</p>
\[\begin{aligned}
X(t) &= \big[\cos\frac{2\pi(1)t}{365.25}, \cdots, \sin\frac{2\pi(10)t}{365.25}\big] \\
\mathbf{\beta} &= [a_1, b_1, \cdots, a_10, b_10] \\
s(t) &= X(t)^T \mathbf{\beta}
\end{aligned}\]
<p>The paper claimed that \(N=10\) as above performs well for yearly seasonality
while \(N=3\) is good for weekly. This design choice can be confirmed using
AIC, for example.</p>
<p>Holiday terms are simple, just a binary function that injects impulse
\(\kappa\) to \(y(t)\) whenever \(t\) is a holiday (as defined by a custom
list). The impulse \(\kappa\sim\text{Normal}(0,\sigma^2)\).</p>
<p>With the model defined, then Prophet will run L-BFGS to fit the paramters.</p>Adrian S. Tamrighthandabacus@users.github.comThis is the paper for Facebook Prophet. It considers time series \(y(t)\) as a composition of trend, seasonality, and holidays under generalized additive model (GAM):Glorot & Bengio (2010) Understanding the difficulty of training deep feedforward neural networks2021-10-15T00:00:00-04:002021-10-15T00:00:00-04:00https://www.adrian.idv.hk/gb10-gradient<p>This is the paper that explains what caused the gradient vanishing or exploding
problem in training neural networks. The approach was to experiment with some
fabricated image datasets as well as ImageNet datasets for multi-class
classifications. Then some theoretical derivation is provided to support the
argument.</p>
<p>The setting is a neural network with 1-5 hidden layers, with 1000 units per
layer and softmax logistic regression for output, i.e.,
\(\textrm{softmax}(Wh+b)\). The cost function in training is average
log-likelihood over a minibatch, i.e.,</p>
\[- (y_i \log p_i + (1-y_i)\log (1-p_i)) = -\log \Pr[y\mid x]\]
<p>and the minibatch size is 10. The paper explored different types of activation
function in the hidden layer; sigmoid \((1+\exp(-x))^{-1}\), hyperbolic tangent
\(\tanh(x)\), and softsign \(x/(1+\lvert x\rvert)\). The initialization of
weight follows the uniform distribution</p>
\[w_{ij} \sim U[-\tfrac{1}{\sqrt{n}}, \tfrac{1}{\sqrt{n}}]\]
<p>for \(n\) the number of columns of \(W\), i.e., number of unit in the previous
layer.</p>
<p>Experimental findings: Sigmoid activation has the activation value at the last
hidden layer quickly pushed to zero while the other layers’ value with mean
above 0.5 for a long time. Hence the neurons are difficult to escape from the
local satuation regime. If the weights are initialized by some pretraining, it
doesn’t exhibut the saturation behavior.</p>
<p>The paper suggested the problem of using the sigmoid function as the following:
The softmax output depends on \(b+Wh\), which \(b\) is updated faster than
\(Wh\). Hence the error gradient will push \(h\) towards zero. A zero \(h\)
means the output of the previous layer was in the saturated regime (which is
specific to sigmoid but not the case of tanh). And at that regime, the slope of
sigmoid function is flat, makes it difficult to escape.</p>
<p>Compared to log-likelihood, the paper also claimed the use of quadratic cost
function (i.e., \(\sum_i (y_i-d_i)^2\) for all output \(y_i\) and desired state
\(d_i\)) can contribute to the vanishing gradient problem because
log-likelihood has steeper slope.</p>
<p>Bradley (2009) studied the gradient as back-propagated from output layer towards the input layer has the variance getting smaller.</p>
<p>Assume a dense network with symmetric activation function with unit derivative at 0, i.e.,</p>
\[z^{(i)} = f(W^{(i)}z^{(i-1)}+b^{(i)}) = f(s^{(i)})\]
<p>and with input as \(z^{(0)}=x\). With the cost \(C\), we have</p>
\[\begin{aligned}
\frac{\partial C}{\partial s^{(i)}_k} &= \frac{\partial C}{\partial s^{(i+1)}}\cdot\frac{\partial s^{(i+1)}}{\partial s^{(i)}_k}
= f'(s^{(i)}_k)W^{(i+1)}_{k\,\bullet}\cdot\frac{\partial C}{\partial s^{(i+1)}}
\\
\frac{\partial C}{\partial w_{jk}^{(i)}} &= z_j^{(i)}\frac{\partial C}{\partial s_k^{(i)}}
\end{aligned}\]
<p>We also denote the size of layer \(i\) as \(n_i\).</p>
<p>Since for independent random varables \(X\) and \(Y\), we have</p>
\[Var(XY)=Var(X)Var(Y)+Var(X)\bar{Y}^2+Var(Y)\bar{X}^2,\]
<p>or for \(X_1,X_2,\cdots,X_k\) we have</p>
\[Var(X_1X_2\cdots X_k)=\prod_i (Var(X_i)+\bar{X}_i^2) - \prod_i \bar{X}_i^2\]
<p>(ref: <a href="https://stats.stackexchange.com/questions/52646/">https://stats.stackexchange.com/questions/52646/</a>). Hence if \(\bar{W}=0\) and \(\bar{z}^{i-1}=0\), and assume \(f'(s^{(i)}_k)\approx 1\).</p>
\[Var(z^{(i)}) \approx Var(W^{(i)}z^{(i-1)}) \approx n_iVar(W^{(i)})Var(z^{(i-1)})\]
<p>Applying to all \(i\) layers,</p>
\[Var(z^{(i)}) \approx Var(x)\prod_{j=0}^{i-1} n_j Var(W^{(j)}).\]
<p>Similarly, the variance of the gradient (which \(d\) is the number of layers in the network),</p>
\[\begin{aligned}
Var(\frac{\partial C}{\partial s^{(i)}}) &\approx Var(\frac{\partial C}{\partial s^{(d)}})\prod_{j=i}^d n_{j+1}Var(W^{(j)}) \\
Var(\frac{\partial C}{\partial w^{(i)}}) &\approx Var(x)Var(\frac{\partial C}{\partial s^{(d)}})\prod_{j=0}^{i-1}n_j Var(W^{(j)})\prod_{j=i}^{d-1}n_{j+1}Var(W^{(j)})
\end{aligned}\]
<p>The goal is to make</p>
\[\begin{aligned}
Var(z^{(i)}) &=Var(z^{(j)}) \\
Var(\frac{\partial C}{\partial s^{(i)}}) &= Var(\frac{\partial C}{\partial s^{(j)}})
\end{aligned}\]
<p>for all layers \(i\) and \(j\) (hence the variance is not exploding nor vanishing). Which the solution is</p>
\[n_i Var(W^{(i)}) = n_{i+1} Var(W^{(i)}) = 1\]
<p>There is a solution only if all layers are of the same size \(n_i=n\) but the paper propose a compromised solution:</p>
\[Var(W^{(i)})=\frac{2}{n_i+n_{i+1}}\]
<p>Uniform distribution \(U[a,b]\) has variance \(\frac{1}{12}(b-a)^2\). Therefore with the uniform initializer \(U[-1/\sqrt{n},1/\sqrt{n}]\), the variance is \(1/(3n)\). Hence \(nVar(W)=\frac13\). This will cause the variance of \(z^{(i)}\) diminish toward the output and gradient \(\partial C/\partial s^{(i)}\) explode toward output. The paper’s proposal is to initialize with</p>
\[W^{(i)} \sim U[-\frac{\sqrt{6}}{\sqrt{n_i+n_{i+1}}}, \frac{\sqrt{6}}{\sqrt{n_i+n_{i+1}}}]\]
<p>which the variance will be \(2/(n_i+n_{i+1})\).</p>Adrian S. Tamrighthandabacus@users.github.comThis is the paper that explains what caused the gradient vanishing or exploding problem in training neural networks. The approach was to experiment with some fabricated image datasets as well as ImageNet datasets for multi-class classifications. Then some theoretical derivation is provided to support the argument.Thinking with type2021-10-03T00:00:00-04:002021-10-03T00:00:00-04:00https://www.adrian.idv.hk/type<p>A nice book for leisure reading, and helps making better decisions on choice of
fonts and page layout. It has a website too: <a href="http://thinkingwithtype.com">http://thinkingwithtype.com</a></p>
<p>Notes below:</p>
<p>Letter</p>
<ul>
<li>Uppercase and lowercase are referring to the physical letterpress cases that holds capital letters and minuscule letters</li>
<li>Typeface is the design, font is the delivery mechanism</li>
<li>Gutenberg’s typeface: Blackletter. Then printers of later generations invented their own typefaces, e.g. Jenson, Garamond, Bembo, Palatino, Caslon, Baskerville, Didot
<ul>
<li>The typefaces are named after their printers</li>
<li>First the humanist typefaces, then thhe transitional and modern</li>
</ul>
</li>
<li>Font size are identified by its height, meausred from the cap line to descender line. However, perceived size is the x-height. Align the x-height when mixing fonts</li>
</ul>
<p><img src="https://rhollick.files.wordpress.com/2017/01/letterform.gif" alt="" /></p>
<ul>
<li>Font sizes are measured in points, or pica (12 points = 1 pica)</li>
<li>Font and linespacing: 8/9 Helvetica = 8 pts font and 9 pts line spacing</li>
<li>Numerals that take up uniform widths of space = Lining numerals; they are invented at 1900s for modern business</li>
<li>Quotation marks make spaces at edge of text. This can be solved by <em>hanging punctuations</em> which move them to the margin</li>
</ul>
<p>Text</p>
<ul>
<li>Kerning = Space between letters; type designers provies kerning table to tell how much space between different letter combinations
<ul>
<li>metric kerning: Use kerning tables provided by the type designer</li>
<li>optical kerning: computationally adjusted kerning</li>
</ul>
</li>
<li>Tracking = letterspacing; usually applied on headlines to increase spacing for emphasis
<ul>
<li>Negative tracking usually not comfortable; positive tracking on lowercase may look awkward</li>
</ul>
</li>
<li>Leading = line spacing = distance between baseline of one line to another
<ul>
<li>default usually 120% of the type size</li>
</ul>
</li>
<li>Flush right also called ragged left</li>
<li>Paragraph indent is common since 17th century. Size is usually an <em>em space</em> a.k.a. <em>quad</em>, which is approximately a cap height
<ul>
<li>Space after paragraph: Skipping half line should be good enough without too much open space</li>
<li>Paragraphs separated with ∥ in ancient text</li>
</ul>
</li>
</ul>
<p>Grid</p>
<ul>
<li>For control</li>
<li>Golden section page design: Text rectangle in golden section, positioned on a page (e.g., letter paper) with uneven margins (e.g., bottom and right are wider than top and left)</li>
<li>Multicolumn grid design:
<img src="http://thinkingwithtype.com/images/Thinking_with_Type_Grid_5b.gif" alt="" /></li>
<li>Modular grid:
<img src="http://thinkingwithtype.com/images/Thinking_with_Type_Grid_8.gif" alt="" /></li>
</ul>Adrian S. Tamrighthandabacus@users.github.comA nice book for leisure reading, and helps making better decisions on choice of fonts and page layout. It has a website too: http://thinkingwithtype.comvan der Maaten & Hinton (2008) Visualizing data using t-SNE2021-09-17T20:14:51-04:002021-09-17T20:14:51-04:00https://www.adrian.idv.hk/mh08-tsne<p>t-SNE is often used as a better alternative than PCA in terms of visualization.
This paper is the one that proposed it. It is an extension to SNE (stochastic
neighbor embedding), which the first few page of the paper outlined it:</p>
<h2 id="sne">SNE</h2>
<p>Assume we have points \(x_1, \cdots, x_N\) in high dimensional coordinates. The
<em>similarity</em> of points \(x_i\) and \(x_j\) is defined as</p>
\[p_{j\mid i} = \frac{\exp(-\Vert x_i - x_j\Vert^2 / 2\sigma_i^2)}{\sum_{k\ne i}\exp(-\Vert x_i - x_k\Vert^2 / 2\sigma_i^2)}\]
<p>where the \(\sigma_i^2\) is the scale parameter to the Gaussian density
function, and it is specific to \(x_i\). The similarity of a point to itself is
defined to be zero, i.e., \(p_{i\mid i}=0\).</p>
<p>If we have a low-dimensional mapping \(y_i\) for each point \(x_i\), we can do
the same to calculate similarity \(q_{j\mid i}\) (but the paper suggest to use
a constant \(\sigma_i^2=\frac12\) to remove the denominators in exponential
functions, to simplify the case in low dimension).</p>
<p>The idea of SNE is that, if the mapping from \(x_i\) to \(y_i\) are perfect,
then \(p_{j\mid i}=q_{j\mid i}\) for all \(i,j\) and we can use the
Kullback-Leiber divergence as the measure of mismatch. We should find \(y_i\)
to minimize the sum of all K-L divergence, i.e. use the following as cost function:</p>
\[C = \sum_i KL(P_i\Vert Q_i) = \sum_i\sum_j p_{j\mid i}\log\frac{p_{j\mid i}}{q_{j\mid i}}\]
<p>we can seek for the minimizer \(y_i\) using gradient descent, where</p>
\[\frac{\partial C}{\partial y_i} = 2\sum_j (p_{j\mid i} - q_{j\mid i} + p_{i\mid j} - q_{i\mid j})(y_i - y_j)\]
<p>The variance in \(p_{j\mid i}\) is set such that points \(x_i\) at denser
regions would use smaller \(\sigma_i^2\), as Shannon entropy \(H\) increases with
\(\sigma_i^2\):</p>
\[H(P_i) = -\sum_k p_{j\mid i}\log_2 p_{j\mid i}\]
<p>Hence we define perplexity as \(2^{H(P_i)}\), which is a measure of effective
number of neighbors for point \(x_i\). A typical value of perplexity, as suggested by the
paper, is between 5 and 50. The variance \(\sigma_i^2\) should be set such that
the perplexity are equal for all the points. This can be found by binary search
(seems like the perplexity is a monotonic function of variance).</p>
<p>From the partial derivative \(\partial C/\partial y_i\) we can see that it is a
force in the direction of \(y_i - y_j\), i.e., point \(y_i\) are subject to
forces toward each other points \(y_j\). The magnitude of the force is
proportional to the <em>stiffness</em> \((p_{j\mid i} - q_{j\mid i} + p_{i\mid j} -
q_{i\mid j})\), which \(p_{j\mid i}+p_{i\mid j}\) is positive and
\(q_{j\mid i}+q_{i\mid j}\) is negative. This means if the similarity is too
close in lower dimension \(y_i\) compared to higher dimension \(x_i\), there
will be a net force to push the point away, and vice versa. Equalibrium will be
attained when the attraction and repulsion magnitudes are equal.</p>
<p>The paper suggested to use gradient descent with momentum, i.e.,</p>
\[y^{(t)} = y^{(t-1)} + \eta \frac{\partial C}{\partial y} + \alpha(t)\big(y^{(t-1)} - y^{(t-2)}\big)\]
<h2 id="t-sne">t-SNE</h2>
<p>The SNE above is using <em>Gaussian kernels</em>, as \(p_{j\mid i}\) and \(q_{j\mid
i}\) are both using Gaussian density function. The t-SNE is to use Cauchy
distribution (i.e., t distribution with dof=1) for \(q_{j\mid i}\), and
normalized over all pairs \(y_i,y_j\):</p>
\[q_{ij} = \frac{(1+\Vert y_i - y_j\Vert^2)^{-1}}{\sum_{k\ne\ell}(1+\Vert y_k - y_{\ell}\Vert^2)^{-1}}\]
<p>and the similarity in high dimension is symmetrizied:</p>
\[p_{ij} = \frac{p_{i\mid j}+p_{j\mid i}}{2}\]
<p>and normalized over the row (i.e. \(p_{ij}\) normalized across all \(j\)).</p>
<p>With the cost function defined in the same way, the gradient is now:</p>
\[\begin{aligned}
C &= KL(P\Vert Q) = \sum_i \sum_j p_{ij} \log\frac{p_{ij}}{q_{ij}} \\
\frac{\partial C}{\partial y_i} &= 4\sum_j (p_{ij} - q_{ij})(y_i - y_j)(1+\Vert y_i - y_j\Vert^2)^{-1}
\end{aligned}\]
<p>The reason for the t-distribution for \(q_{ij}\) and Gaussian for \(p_{ij}\) is
that t-distribution is heavytailed and there is a unique sweet spot of
separation. Therefore, it will maintain a reasonable scale in low dimension for
\(q_{ij}\).</p>
<h2 id="implementation">Implementation</h2>
<p>The reference implementation is from the author: <a href="https://github.com/lvdmaaten/bhtsne/blob/master/tsne.cpp">https://github.com/lvdmaaten/bhtsne/blob/master/tsne.cpp</a>
and below is how I ported it into Python.</p>
<pre><code class="language-python">import datetime
import sys
import numpy as np
def tSNE(X, no_dims=2, perplexity=30, seed=0, max_iter=1000, stop_lying_iter=100, mom_switch_iter=900):
"""The t-SNE algorithm
Args:
X: the high-dimensional coordinates
no_dims: number of dimensions in output domain
Returns:
Points of X in low dimension
"""
momentum = 0.5
final_momentum = 0.8
eta = 200.0
N, _D = X.shape
np.random.seed(seed)
# normalize input
X -= X.mean(axis=0) # zero mean
X /= np.abs(X).max() # min-max scaled
# compute input similarity for exact t-SNE
P = computeGaussianPerplexity(X, perplexity)
# symmetrize and normalize input similarities
P = P + P.T
P /= P.sum()
# lie about the P-values
P *= 12.0
# initialize solution
Y = np.random.randn(N, no_dims) * 0.0001
# perform main training loop
gains = np.ones_like(Y)
uY = np.zeros_like(Y)
for i in range(max_iter):
# compute gradient, update gains
dY = computeExactGradient(P, Y)
gains = np.where(np.sign(dY) != np.sign(uY), gains+0.2, gains*0.8).clip(0.1)
# gradient update with momentum and gains
uY = momentum * uY - eta * gains * dY
Y = Y + uY
# make the solution zero-mean
Y -= Y.mean(axis=0)
# Stop lying about the P-values after a while, and switch momentum
if i == stop_lying_iter:
P /= 12.0
if i == mom_switch_iter:
momentum = final_momentum
# print progress
if (i % 50) == 0:
C = evaluateError(P, Y)
now = datetime.datetime.now()
print(f"{now} - Iteration {i}: Error = {C}")
return Y
def computeExactGradient(P, Y):
"""Gradient of t-SNE cost function
Args:
P: similarity matrix
Y: low-dimensional coordinates
Returns:
dY, a numpy array of shape (N,D)
"""
N, _D = Y.shape
# compute squared Euclidean distance matrix of Y, the Q matrix, and the normalization sum
DD = computeSquaredEuclideanDistance(Y)
Q = 1/(1+DD)
sum_Q = Q.sum()
# compute gradient
mult = (P - (Q/sum_Q)) * Q
dY = np.zeros_like(Y)
for n in range(N):
for m in range(N):
if n==m: continue
dY[n] += (Y[n] - Y[m]) * mult[n,m]
return dY
def evaluateError(P, Y):
"""Evaluate t-SNE cost function
Args:
P: similarity matrix
Y: low-dimensional coordinates
Returns:
Total t-SNE error C
"""
DD = computeSquaredEuclideanDistance(Y)
# Compute Q-matrix and normalization sum
Q = 1/(1+DD)
np.fill_diagonal(Q, sys.float_info.min)
Q /= Q.sum()
# Sum t-SNE error: sum P log(P/Q)
error = P * np.log( (P + sys.float_info.min) / (Q + sys.float_info.min) )
return error.sum()
def computeGaussianPerplexity(X, perplexity):
"""Compute Gaussian Perplexity
Args:
X: numpy array of shape (N,D)
perplexity: double
Returns:
Similarity matrix P
"""
# Compute the squared Euclidean distance matrix
N, _D = X.shape
DD = computeSquaredEuclideanDistance(X)
# Compute the Gaussian kernel row by row
P = np.zeros_like(DD)
for n in range(N):
found = False
beta = 1.0
min_beta = -np.inf
max_beta = np.inf
tol = 1e-5
# iterate until we get a good perplexity
n_iter = 0
while not found and n_iter < 200:
# compute Gaussian kernel row
P[n] = np.exp(-beta * DD[n])
P[n,n] = sys.float_info.min
# compute entropy of current row
# Gaussians to be row-normalized to make it a probability
# then H = sum_i -P[i] log(P[i])
# = sum_i -P[i] (-beta * DD[n] - log(sum_P))
# = sum_i P[i] * beta * DD[n] + log(sum_P)
sum_P = P[n].sum()
H = beta * (DD[n] @ P[n]) / sum_P + np.log(sum_P)
# Evaluate if entropy within tolerance level
Hdiff = H - np.log2(perplexity)
if -tol < Hdiff < tol:
found = True
break
if Hdiff > 0:
min_beta = beta
if max_beta in (np.inf, -np.inf):
beta *= 2
else:
beta = (beta + max_beta) / 2
else:
max_beta = beta
if min_beta in (np.inf, -np.inf):
beta /= 2
else:
beta = (beta + min_beta) / 2
n_iter += 1
# normalize this row
P[n] /= P[n].sum()
assert not np.isnan(P).any()
return P
def computeSquaredEuclideanDistance(X):
"""Compute squared distance
Args:
X: numpy array of shape (N,D)
Returns:
numpy array of shape (N,N) of squared distances
"""
N, _D = X.shape
DD = np.zeros((N,N))
for i in range(N-1):
for j in range(i+1, N):
diff = X[i] - X[j]
DD[j][i] = DD[i][j] = diff @ diff
return DD
def main():
import tensorflow as tf
(X_train, y_train), (X_test, y_test) = tf.keras.datasets.mnist.load_data()
print("Dimension of X_train:", X_train.shape)
print("Dimension of y_train:", y_train.shape)
print("Dimension of X_test:", X_test.shape)
print("Dimension of y_test:", y_test.shape)
n = 200
rows = np.random.choice(X_train.shape[0], n, replace=False)
X_data = X_train[rows].reshape(n, -1).astype("float")
X_label = y_train[rows]
Y = tSNE(X_data, 2, 30, 0, 1000, 100, 900)
np.savez("data.npz", X=X_data, Y=Y, label=X_label)
if __name__ == "__main__":
main()
</code></pre>
<p>From the code, we see a few tweaks that are not mentioned in the paper:</p>
<p>In <code>tSNE()</code>, the main algorithm:</p>
<ul>
<li>The initial points of \(y_i\) are randomized using
\(N(\mu=0,\sigma=10^{-4})\). The small standard deviation will make the
initial points closely packed.</li>
<li>Gradient descent used in the algorithm is not exactly like that shown in the
formula above. There is a <em>gain</em> factor multiplied to the gradient descent.
It started as 1 and updated according to the sign of \(\partial C/\partial
y_i\) and \((y_i^{(t-1)} - y_i^{(t-2)})\). It will increase by 0.2 if signs
differ and decrease to its 80% if signs agree, with the lowerbound of 0.1;
this is to make the gradient descent do larger steps if we are moving in the
right direction</li>
<li>Initially the values of \(p_{ij}\) are multiplied by 12 and this amplified
version of \(P\) matrix is used to calculate the gradient in the beginning,
then scaled back to the original values of \(P\)</li>
<li>The momentum used in gradient descent will be switched from 0.5 to 0.8 at later stage</li>
</ul>
<p>In <code>computeExactGradient()</code></p>
<ul>
<li>We first compute the multiplier \((p_{ij}-q_{ij})(1+\Vert y_i-y_j\Vert^2)^{-1}\)
as matrix <code>mult</code> to make it computationally more efficient</li>
<li>While the gradient formula above carried a coefficient 4, it is not used.
Instead, we use a bigger (200) \(\eta\) in <code>tSNE()</code> to compensate</li>
</ul>
<p>In <code>computeGaussianPerplexity()</code>:</p>
<ul>
<li>The binary search is using a tolerance \(10^{-5}\), which the entropy
calculated would have at most this much of error from the log of the target
perplexity</li>
<li>Instead of doing bisection search on \(\sigma_i^2\), it is searching on
\(\beta=1/2\sigma_i^2\). Hence \(\beta\) is larger for a smaller entropy
\(H\). The search will reduce \(\beta\) for overshot \(H\).</li>
<li>The matrix \(P\) is normalized by row, but the diagonal entries are using
<code>sys.float_info.min</code> which is the smallest positive float in the system. This
is essentially zero but avoids division-by-zero error in corner cases.</li>
</ul>
<h2 id="remark">Remark:</h2>
<p>The PDF of t-distribution with dof \(\nu>0\) is defined for \(x\in(-\infty,\infty)\):</p>
\[f(x) = \frac{\Gamma(\frac{\nu+1}{2})}{\sqrt{\nu\pi}\Gamma(\frac{\nu}{2})} \left(1+\frac{x^2}{\nu}\right)^{-\frac{\nu+1}{2}}\]
<p>which \(\nu=1\) gives Cauchy distribution with location \(x_0=0\) and scale \(\gamma=1\):</p>
\[\begin{aligned}
f(x) &= \frac{\Gamma(1)}{\sqrt{\pi}\Gamma(\frac12)} (1+x^2)^{-1} \\
&= \frac{1}{\sqrt{\pi}\sqrt{\pi}} (1+x^2)^{-1} \\
&= \frac{1}{\pi (1+x^2)}
\end{aligned}\]Adrian S. Tamrighthandabacus@users.github.comt-SNE is often used as a better alternative than PCA in terms of visualization. This paper is the one that proposed it. It is an extension to SNE (stochastic neighbor embedding), which the first few page of the paper outlined it:Excel conventions2021-09-15T00:00:00-04:002021-09-15T00:00:00-04:00https://www.adrian.idv.hk/excel<p>Excel is quite ubiquitous but not many people use it well. The finance industry
use it a lot and there are some good practices created. The key to use Excel
(or other spreadsheet) well is to remember that there are two layers in it, the
formula and the result, and it encourage people to edit it as they read it.</p>
<p>The first rule is never start from A1 cell. Keep two rows and two columns empty
so you can easily add something, be it a temporary calculation to try out
something, or allow some empty space so you can right click on and do something
else. Besides, a spreadsheet allows you to set the size, you can simply make
the rows and columns very narrow to conserve screen real estates.</p>
<p>The next rule is to make the worksheet easier to read. Unifying the font and
row height, all numbers in the same format (e.g., 2 d.p. with thousand
separator), text cells all aligned to left while numbers aligned to right, are
some of the trivial ones. But I think the most eye-opening is the color
conventions, which you can easily spot whether a cell is a constant or a
formula:</p>
<table>
<thead>
<tr>
<th>Type</th>
<th>Example</th>
<th>Color</th>
</tr>
</thead>
<tbody>
<tr>
<td>Hard-coded input</td>
<td><code>=12.34</code></td>
<td>Blue</td>
</tr>
<tr>
<td>Formula</td>
<td><code>=D3*2</code></td>
<td>Black</td>
</tr>
<tr>
<td>Link to other worksheet</td>
<td><code>=Sheet2!C3</code></td>
<td>Green</td>
</tr>
<tr>
<td>Link to other files</td>
<td><code>=[Book2]Sheet1!C3</code></td>
<td>Red</td>
</tr>
<tr>
<td>Link to external data providers</td>
<td><code>=BDP("GOOG US Equity", "PX_LAST")</code></td>
<td>Dark red</td>
</tr>
</tbody>
</table>
<p>We color the text and leave the background and border color for other use
(e.g., different sections, or mark alternative rows in a table, or even use it
for heatmap so we can visually easier to identify some interesting spot). Also,
to make the sheet less messy, try not to use borders at all but use horizontal
rules at strategic location, refer to LaTeX booktabs package for what a
publish-quality table should look like.</p>
<p>Also try not to hide rows and columns, but group them and collapse them. It
would be easier to toggle the visibility.</p>
<p>These should make the most impact already. Of course, there are some other tips
learned from software engineering:</p>
<ul>
<li>keep number constants off the formula, use a cell to hold it so you are
easier to modify and eaiser to understand what those numbers are</li>
<li>comments, to explain the numbers and explain the formula, and put down
references for such decision</li>
<li>avoid link to other files, since this will be easily broken</li>
<li>keep the model and view separate, i.e., put the detailed calculation in one
sheet and executive summary at another</li>
<li>elevator jump: Keep one column or row empty and put a placeholder (e.g., “x”)
at strategic location. Then we can use this row or column to quickly move
around, using Ctrl-Arrow key</li>
</ul>
<p>For the hot keys, these should be productive:</p>
<ul>
<li>Mouse click on cells while holding Ctrl to select multiple cells; then type something with Ctrl-Enter to populate to all cells</li>
<li>Shift-Space to select entire row; Ctrl-Space to select entire column</li>
<li>Ctrl-Plus to add row/column/cell and Ctrl-Minus to delete row/column/cell (mac use Cmd instead of Ctrl); it will automatically add/delete row or column if the entire row/column is already selected</li>
<li>F4 is to repeat last action (e.g., set color/font/format, etc.)</li>
<li>Holding Ctrl or Shift while selecting multiple sheet will create a group of worksheets, then entering on one cell will apply to the same cell on all sheets</li>
<li>Find cells with formula: F5 will pop-up to “Go To” dialog, click on “Special” button at bottom corner and select “Formula” will select all cells with formula
<ul>
<li>similarly, can select all cells that are blank, entered a constant, with an object, etc.</li>
</ul>
</li>
<li>Ctrl-; to enter today’s date</li>
<li>Ctrl-Arrow to move to the end on that direction</li>
<li>Ctrl-Shift-L to toggle filtering of the selected cells
<ul>
<li>Then at header row Alt+down will show the filter menu</li>
</ul>
</li>
<li>Ctrl-9 to hide rows; Ctrl-Shift-9 to unhide</li>
<li>Ctrl-0 to hide columns; Ctrl-Shift-0 to unhide</li>
<li>Selected entire rows/columns, then Shift-Alt-Right will create a group of rows/columns; Shift-Alt-Left will ungroup (mac use Cmd-Shift-K and Cmd-Shift-J)</li>
</ul>
<p>Finally, index and match (which is the modern alternative to VLOOKUP):</p>
<p><code>=INDEX(A1:Z10, 3, 5)</code> gives E3. We can think of <code>=INDEX(array, i, j)</code> as
<code>array[i][j]</code> with row-major array and indices start at 1.</p>
<p><code>=MATCH(needle, haystack, 0)</code> is to find exact match of “needle” in vector
“haystack”. The vector must either be a row array or a column array. The third
parameter can be -1 or 1 for “less than or equal to” or “greater than or equal
to” look up, but this would imply the vector to be sorted in ascending order or
descending order respectively. If doing exact match, the search term “needle”
can be a wildcard, e.g. <code>abc*</code>.</p>
<p>Combine together, we may write a formula such as <code>=INDEX(F3:P11, MATCH("foo",A3:A11,0), 2)</code></p>Adrian S. Tamrighthandabacus@users.github.comExcel is quite ubiquitous but not many people use it well. The finance industry use it a lot and there are some good practices created. The key to use Excel (or other spreadsheet) well is to remember that there are two layers in it, the formula and the result, and it encourage people to edit it as they read it.