Jekyll2021-10-03T17:01:09-04:00https://www.adrian.idv.hk/feed.xml∫ntegrabℓε ∂ifferentiαℓsunorganised memo, notes, code, data, and writings of random topicsAdrian S. Tamrighthandabacus@users.github.comvan der Maaten &amp; Hinton (2008) Visualizing data using t-SNE2021-09-17T20:14:51-04:002021-09-17T20:14:51-04:00https://www.adrian.idv.hk/mh08-tsne<p>t-SNE is often used as a better alternative than PCA in terms of visualization. This paper is the one that proposed it. It is an extension to SNE (stochastic neighbor embedding), which the first few page of the paper outlined it:</p> <h2 id="sne">SNE</h2> <p>Assume we have points $$x_1, \cdots, x_N$$ in high dimensional coordinates. The <em>similarity</em> of points $$x_i$$ and $$x_j$$ is defined as</p> $p_{j\mid i} = \frac{\exp(-\Vert x_i - x_j\Vert^2 / 2\sigma_i^2)}{\sum_{k\ne i}\exp(-\Vert x_i - x_k\Vert^2 / 2\sigma_i^2)}$ <p>where the $$\sigma_i^2$$ is the scale parameter to the Gaussian density function, and it is specific to $$x_i$$. The similarity of a point to itself is defined to be zero, i.e., $$p_{i\mid i}=0$$.</p> <p>If we have a low-dimensional mapping $$y_i$$ for each point $$x_i$$, we can do the same to calculate similarity $$q_{j\mid i}$$ (but the paper suggest to use a constant $$\sigma_i^2=\frac12$$ to remove the denominators in exponential functions, to simplify the case in low dimension).</p> <p>The idea of SNE is that, if the mapping from $$x_i$$ to $$y_i$$ are perfect, then $$p_{j\mid i}=q_{j\mid i}$$ for all $$i,j$$ and we can use the Kullback-Leiber divergence as the measure of mismatch. We should find $$y_i$$ to minimize the sum of all K-L divergence, i.e. use the following as cost function:</p> $C = \sum_i KL(P_i\Vert Q_i) = \sum_i\sum_j p_{j\mid i}\log\frac{p_{j\mid i}}{q_{j\mid i}}$ <p>we can seek for the minimizer $$y_i$$ using gradient descent, where</p> $\frac{\partial C}{\partial y_i} = 2\sum_j (p_{j\mid i} - q_{j\mid i} + p_{i\mid j} - q_{i\mid j})(y_i - y_j)$ <p>The variance in $$p_{j\mid i}$$ is set such that points $$x_i$$ at denser regions would use smaller $$\sigma_i^2$$, as Shannon entropy $$H$$ increases with $$\sigma_i^2$$:</p> $H(P_i) = -\sum_k p_{j\mid i}\log_2 p_{j\mid i}$ <p>Hence we define perplexity as $$2^{H(P_i)}$$, which is a measure of effective number of neighbors for point $$x_i$$. A typical value of perplexity, as suggested by the paper, is between 5 and 50. The variance $$\sigma_i^2$$ should be set such that the perplexity are equal for all the points. This can be found by binary search (seems like the perplexity is a monotonic function of variance).</p> <p>From the partial derivative $$\partial C/\partial y_i$$ we can see that it is a force in the direction of $$y_i - y_j$$, i.e., point $$y_i$$ are subject to forces toward each other points $$y_j$$. The magnitude of the force is proportional to the <em>stiffness</em> $$(p_{j\mid i} - q_{j\mid i} + p_{i\mid j} - q_{i\mid j})$$, which $$p_{j\mid i}+p_{i\mid j}$$ is positive and $$q_{j\mid i}+q_{i\mid j}$$ is negative. This means if the similarity is too close in lower dimension $$y_i$$ compared to higher dimension $$x_i$$, there will be a net force to push the point away, and vice versa. Equalibrium will be attained when the attraction and repulsion magnitudes are equal.</p> <p>The paper suggested to use gradient descent with momentum, i.e.,</p> $y^{(t)} = y^{(t-1)} + \eta \frac{\partial C}{\partial y} + \alpha(t)\big(y^{(t-1)} - y^{(t-2)}\big)$ <h2 id="t-sne">t-SNE</h2> <p>The SNE above is using <em>Gaussian kernels</em>, as $$p_{j\mid i}$$ and $$q_{j\mid i}$$ are both using Gaussian density function. The t-SNE is to use Cauchy distribution (i.e., t distribution with dof=1) for $$q_{j\mid i}$$, and normalized over all pairs $$y_i,y_j$$:</p> $q_{ij} = \frac{(1+\Vert y_i - y_j\Vert^2)^{-1}}{\sum_{k\ne\ell}(1+\Vert y_k - y_{\ell}\Vert^2)^{-1}}$ <p>and the similarity in high dimension is symmetrizied:</p> $p_{ij} = \frac{p_{i\mid j}+p_{j\mid i}}{2}$ <p>and normalized over the row (i.e. $$p_{ij}$$ normalized across all $$j$$).</p> <p>With the cost function defined in the same way, the gradient is now:</p> \begin{aligned} C &amp;= KL(P\Vert Q) = \sum_i \sum_j p_{ij} \log\frac{p_{ij}}{q_{ij}} \\ \frac{\partial C}{\partial y_i} &amp;= 4\sum_j (p_{ij} - q_{ij})(y_i - y_j)(1+\Vert y_i - y_j\Vert^2)^{-1} \end{aligned} <p>The reason for the t-distribution for $$q_{ij}$$ and Gaussian for $$p_{ij}$$ is that t-distribution is heavytailed and there is a unique sweet spot of separation. Therefore, it will maintain a reasonable scale in low dimension for $$q_{ij}$$.</p> <h2 id="implementation">Implementation</h2> <p>The reference implementation is from the author: <a href="https://github.com/lvdmaaten/bhtsne/blob/master/tsne.cpp">https://github.com/lvdmaaten/bhtsne/blob/master/tsne.cpp</a> and below is how I ported it into Python.</p> <pre><code class="language-python">import datetime import sys import numpy as np def tSNE(X, no_dims=2, perplexity=30, seed=0, max_iter=1000, stop_lying_iter=100, mom_switch_iter=900): """The t-SNE algorithm Args: X: the high-dimensional coordinates no_dims: number of dimensions in output domain Returns: Points of X in low dimension """ momentum = 0.5 final_momentum = 0.8 eta = 200.0 N, _D = X.shape np.random.seed(seed) # normalize input X -= X.mean(axis=0) # zero mean X /= np.abs(X).max() # min-max scaled # compute input similarity for exact t-SNE P = computeGaussianPerplexity(X, perplexity) # symmetrize and normalize input similarities P = P + P.T P /= P.sum() # lie about the P-values P *= 12.0 # initialize solution Y = np.random.randn(N, no_dims) * 0.0001 # perform main training loop gains = np.ones_like(Y) uY = np.zeros_like(Y) for i in range(max_iter): # compute gradient, update gains dY = computeExactGradient(P, Y) gains = np.where(np.sign(dY) != np.sign(uY), gains+0.2, gains*0.8).clip(0.1) # gradient update with momentum and gains uY = momentum * uY - eta * gains * dY Y = Y + uY # make the solution zero-mean Y -= Y.mean(axis=0) # Stop lying about the P-values after a while, and switch momentum if i == stop_lying_iter: P /= 12.0 if i == mom_switch_iter: momentum = final_momentum # print progress if (i % 50) == 0: C = evaluateError(P, Y) now = datetime.datetime.now() print(f"{now} - Iteration {i}: Error = {C}") return Y def computeExactGradient(P, Y): """Gradient of t-SNE cost function Args: P: similarity matrix Y: low-dimensional coordinates Returns: dY, a numpy array of shape (N,D) """ N, _D = Y.shape # compute squared Euclidean distance matrix of Y, the Q matrix, and the normalization sum DD = computeSquaredEuclideanDistance(Y) Q = 1/(1+DD) sum_Q = Q.sum() # compute gradient mult = (P - (Q/sum_Q)) * Q dY = np.zeros_like(Y) for n in range(N): for m in range(N): if n==m: continue dY[n] += (Y[n] - Y[m]) * mult[n,m] return dY def evaluateError(P, Y): """Evaluate t-SNE cost function Args: P: similarity matrix Y: low-dimensional coordinates Returns: Total t-SNE error C """ DD = computeSquaredEuclideanDistance(Y) # Compute Q-matrix and normalization sum Q = 1/(1+DD) np.fill_diagonal(Q, sys.float_info.min) Q /= Q.sum() # Sum t-SNE error: sum P log(P/Q) error = P * np.log( (P + sys.float_info.min) / (Q + sys.float_info.min) ) return error.sum() def computeGaussianPerplexity(X, perplexity): """Compute Gaussian Perplexity Args: X: numpy array of shape (N,D) perplexity: double Returns: Similarity matrix P """ # Compute the squared Euclidean distance matrix N, _D = X.shape DD = computeSquaredEuclideanDistance(X) # Compute the Gaussian kernel row by row P = np.zeros_like(DD) for n in range(N): found = False beta = 1.0 min_beta = -np.inf max_beta = np.inf tol = 1e-5 # iterate until we get a good perplexity n_iter = 0 while not found and n_iter &lt; 200: # compute Gaussian kernel row P[n] = np.exp(-beta * DD[n]) P[n,n] = sys.float_info.min # compute entropy of current row # Gaussians to be row-normalized to make it a probability # then H = sum_i -P[i] log(P[i]) # = sum_i -P[i] (-beta * DD[n] - log(sum_P)) # = sum_i P[i] * beta * DD[n] + log(sum_P) sum_P = P[n].sum() H = beta * (DD[n] @ P[n]) / sum_P + np.log(sum_P) # Evaluate if entropy within tolerance level Hdiff = H - np.log2(perplexity) if -tol &lt; Hdiff &lt; tol: found = True break if Hdiff &gt; 0: min_beta = beta if max_beta in (np.inf, -np.inf): beta *= 2 else: beta = (beta + max_beta) / 2 else: max_beta = beta if min_beta in (np.inf, -np.inf): beta /= 2 else: beta = (beta + min_beta) / 2 n_iter += 1 # normalize this row P[n] /= P[n].sum() assert not np.isnan(P).any() return P def computeSquaredEuclideanDistance(X): """Compute squared distance Args: X: numpy array of shape (N,D) Returns: numpy array of shape (N,N) of squared distances """ N, _D = X.shape DD = np.zeros((N,N)) for i in range(N-1): for j in range(i+1, N): diff = X[i] - X[j] DD[j][i] = DD[i][j] = diff @ diff return DD def main(): import tensorflow as tf (X_train, y_train), (X_test, y_test) = tf.keras.datasets.mnist.load_data() print("Dimension of X_train:", X_train.shape) print("Dimension of y_train:", y_train.shape) print("Dimension of X_test:", X_test.shape) print("Dimension of y_test:", y_test.shape) n = 200 rows = np.random.choice(X_train.shape, n, replace=False) X_data = X_train[rows].reshape(n, -1).astype("float") X_label = y_train[rows] Y = tSNE(X_data, 2, 30, 0, 1000, 100, 900) np.savez("data.npz", X=X_data, Y=Y, label=X_label) if __name__ == "__main__": main() </code></pre> <p>From the code, we see a few tweaks that are not mentioned in the paper:</p> <p>In <code>tSNE()</code>, the main algorithm:</p> <ul> <li>The initial points of $$y_i$$ are randomized using $$N(\mu=0,\sigma=10^{-4})$$. The small standard deviation will make the initial points closely packed.</li> <li>Gradient descent used in the algorithm is not exactly like that shown in the formula above. There is a <em>gain</em> factor multiplied to the gradient descent. It started as 1 and updated according to the sign of $$\partial C/\partial y_i$$ and $$(y_i^{(t-1)} - y_i^{(t-2)})$$. It will increase by 0.2 if signs differ and decrease to its 80% if signs agree, with the lowerbound of 0.1; this is to make the gradient descent do larger steps if we are moving in the right direction</li> <li>Initially the values of $$p_{ij}$$ are multiplied by 12 and this amplified version of $$P$$ matrix is used to calculate the gradient in the beginning, then scaled back to the original values of $$P$$</li> <li>The momentum used in gradient descent will be switched from 0.5 to 0.8 at later stage</li> </ul> <p>In <code>computeExactGradient()</code></p> <ul> <li>We first compute the multiplier $$(p_{ij}-q_{ij})(1+\Vert y_i-y_j\Vert^2)^{-1}$$ as matrix <code>mult</code> to make it computationally more efficient</li> <li>While the gradient formula above carried a coefficient 4, it is not used. Instead, we use a bigger (200) $$\eta$$ in <code>tSNE()</code> to compensate</li> </ul> <p>In <code>computeGaussianPerplexity()</code>:</p> <ul> <li>The binary search is using a tolerance $$10^{-5}$$, which the entropy calculated would have at most this much of error from the log of the target perplexity</li> <li>Instead of doing bisection search on $$\sigma_i^2$$, it is searching on $$\beta=1/2\sigma_i^2$$. Hence $$\beta$$ is larger for a smaller entropy $$H$$. The search will reduce $$\beta$$ for overshot $$H$$.</li> <li>The matrix $$P$$ is normalized by row, but the diagonal entries are using <code>sys.float_info.min</code> which is the smallest positive float in the system. This is essentially zero but avoids division-by-zero error in corner cases.</li> </ul> <h2 id="remark">Remark:</h2> <p>The PDF of t-distribution with dof $$\nu&gt;0$$ is defined for $$x\in(-\infty,\infty)$$:</p> $f(x) = \frac{\Gamma(\frac{\nu+1}{2})}{\sqrt{\nu\pi}\Gamma(\frac{\nu}{2})} \left(1+\frac{x^2}{\nu}\right)^{-\frac{\nu+1}{2}}$ <p>which $$\nu=1$$ gives Cauchy distribution with location $$x_0=0$$ and scale $$\gamma=1$$:</p> \begin{aligned} f(x) &amp;= \frac{\Gamma(1)}{\sqrt{\pi}\Gamma(\frac12)} (1+x^2)^{-1} \\ &amp;= \frac{1}{\sqrt{\pi}\sqrt{\pi}} (1+x^2)^{-1} \\ &amp;= \frac{1}{\pi (1+x^2)} \end{aligned}Adrian S. Tamrighthandabacus@users.github.comt-SNE is often used as a better alternative than PCA in terms of visualization. This paper is the one that proposed it. It is an extension to SNE (stochastic neighbor embedding), which the first few page of the paper outlined it:Using neovim2021-09-13T16:23:13-04:002021-09-13T16:23:13-04:00https://www.adrian.idv.hk/nvim<p>Have been using neovim for a few years now and I don’t feel a thing! To remember what should be done in case of reinstallation from scratch, it is better to write down some tips and configuration.</p> <p>Install: You can apt-get, or brew install. The configuration is at <code>~/.config/nvim/</code> but for compatibility with vim, I do this:</p> <pre><code>cd ~ ln -s .vim .config/nvim ln -s .vimrc .vim/init.vim </code></pre> <p>and to make it default,</p> <pre><code>alias vi=nvim alias vimdiff='nvim -d' </code></pre> <p>Plug-in management is recommended to use <a href="https://vi.stackexchange.com/questions/388/">vim-plug</a> and see <a href="https://www.linode.com/docs/guides/how-to-install-neovim-and-plugins-with-vim-plug/">here for examples of using it with neovim</a>. Simply speaking, to install vim-plug:</p> <pre><code>curl -fLo ~/.vim/autoload/plug.vim --create-dirs \ https://raw.githubusercontent.com/junegunn/vim-plug/master/plug.vim </code></pre> <p>and to install plug-ins, use <code>:PlugInstall</code> command inside vim or <code>:PlugUpgrade</code> to upgrade vim-plug. To update the installed packages, use <code>:UpdateRemotePlugins</code>. The list of plugins used are entered in <code>~/.config/nvim/init.vim</code> as:</p> <pre><code>" https://github.com/junegunn/vim-plug call plug#begin(stdpath('data'). '/plugged') " tabular plugin is used to format tables " select lines, then :Tab /&lt;regex&gt; will look for the delimiter and tabularize Plug 'godlygeek/tabular' " JSON front matter highlight plugin Plug 'elzr/vim-json' " markdown plugin Plug 'plasticboy/vim-markdown', { 'branch':'master' } call plug#end() </code></pre> <p>which the <code>Plug</code> command accepts full URL to git repository (e.g., <code>https://github.com/junegunn/vim-github-dashboard.git</code>) or a shorthand, <code>junegunn/vim-github-dashboard</code>. And you can choose the git tag or branch like the last example above.</p> <p>and this is my other config in <code>init.vim</code>:</p> <pre><code>set mouse=a " mouse in normal, visual, and insert mode "set expandtab " converts tabs to white space set nocompatible " disable compatibility to old-time vi set showmatch " show matching brackets. set ignorecase " case insensitive matching set hlsearch " highlight search results set tabstop=4 " number of columns occupied by a tab character set softtabstop=4 " see multiple spaces as tabstops so &lt;BS&gt; does the right thing set shiftwidth=4 " width for autoindents set autoindent " indent a new line the same amount as the line just typed "set number " add line numbers set wildmode=longest,list " get bash-like tab completions set cc=100 " set an 100 column border for good coding style "set nowrapscan " search stop at end of file set cursorline filetype plugin indent on " allows auto-indenting depending on file type syntax on " syntax highlighting " mouse release send selection to clipboard vmap &lt;LeftRelease&gt; "*ygv " terminal mode: Use ESC to back to normal mode tnoremap &lt;Esc&gt; &lt;C-\&gt;&lt;C-n&gt; " toggle spell check by &lt;F11&gt;, [s and ]s for prev/next spell error, z= for suggestions " https://jdhao.github.io/2019/04/29/nvim_spell_check/ set spelllang=en,cjk nnoremap &lt;silent&gt; &lt;F11&gt; :set spell!&lt;cr&gt; inoremap &lt;silent&gt; &lt;F11&gt; &lt;C-O&gt;:set spell!&lt;cr&gt; "set spell " turn on spell check by default set spelllang=en_us " https://stackoverflow.com/questions/597687/changing-variable-names-in-vim " For local replace nnoremap gr gd[{V%::s/&lt;C-R&gt;///gc&lt;left&gt;&lt;left&gt;&lt;left&gt; " For global replace nnoremap gR gD:%s/&lt;C-R&gt;///gc&lt;left&gt;&lt;left&gt;&lt;left&gt; </code></pre>Adrian S. Tamrighthandabacus@users.github.comHave been using neovim for a few years now and I don’t feel a thing! To remember what should be done in case of reinstallation from scratch, it is better to write down some tips and configuration.Time Management for System Administrators2021-09-11T14:44:37-04:002021-09-11T14:44:37-04:00https://www.adrian.idv.hk/timemgmt<p>This is a book I read 15 years ago. It is still very insightful re-reading it today. Perhaps because of the examples are more relevant, I found the time management advices are more useful. The theme of this book is “do not trust your brain” and therefore, we should offload stuff from our brain.</p> <p>The reason system administrators have time management problem is because of the constant interruptions from user requests. There are several ways to tackle this: set up a mutual interruption sheild with a coworker to catch all interruptions for another at one’s project time, or delegate requests to a junior tier 1 helpdesk so 80% of the interruption would be solved there. Interruptions undermine focus during project time and we will get nothing done. Doing tasks as they arrive is to let interrupters manage your time.</p> <p>Avoiding interruptions is the best way to reduce their impact. Measure twice and cut once can avoid accidents, which cause interruptions to put off the fire. So make backup before change. If accident may still occur, plan for a better time to do it so you can get help or have room to fix it, such as changing tapes in the morning, not at last hour of work. Also to plan project time at personal or company rhythms to get best result (e.g., peak time for mental activities, quieter, months of less busy).</p> <p>The principles of time management is layed out in chapter 1:</p> <ul> <li>one database for all information (e.g. an organizer)</li> <li>conserve your brain power for what’s important</li> <li>develop routines: reuse code libraries, stop reinventing the wheel</li> <li>develop habits and mantras (replace calculations with precomputed decisions, e.g., always yes for these type of questions)</li> <li>maintain focus during project time</li> <li>manage social life with the same tools as work life</li> </ul> <p>These principles are mostly to “conserve brain power”, or I call it <em>to reduce congnitive load</em>. The core of the book is the “Cycle” system but there are many useful tips that can relief our brains too:</p> <ul> <li>Use a window manager with virtual screens, so we use one for email, one for monitoring, etc. So we do not look around where is our window (Ch.2) <ul> <li>i.e., declutter your work environment</li> </ul> </li> <li>Habit to put windows the same way, e.g., document to read on left and editor on right (Ch.2)</li> <li>Use Nagio for monitoring, so you get a dashboard of everything (Ch.2)</li> <li>Set up routines to fill gas on Sunday, weekly meeting with boss, etc. (Ch.3)</li> <li>Routine to clean up things, e.g. meet with delegates every Thursday to troubleshoot their problems and remove roadblocks (Ch.3) <ul> <li>e.g., unsubscribe mailing lists weekly, revise schedules first day every month</li> </ul> </li> <li>Always say “yes” to something, such as “should I write this down” (Ch.3)</li> <li>Set up protocol for managing events, e.g. outages (Ch.3) <ul> <li>who reports, how frequent to update reports, who to focus on fixing problems</li> <li>so we do not think when we’re in hurry</li> </ul> </li> <li>Set up automatic checks (Ch.3), for example <ul> <li>make it a habit to verify airflow of cooling fans whenever you pass by</li> <li>always keep key card in pocket</li> <li>run continous ping before plugging/unplugging cables</li> </ul> </li> <li>Set up filters for email, touch each email only once (Ch.10)</li> <li>Use procmail for server-side filtering (Ch.10)</li> <li>Set up “pickled email” folder to put old email at rest (Ch.10)</li> <li>Do not use email as todo list, as it is ephemeral (Ch.10)</li> <li>Use bash alias, set up host in ssh/config, use Makefile to automate tasks (Ch.13)</li> </ul> <p>The Cycle system proposed by the book is the following:</p> <ul> <li>Use a combined to-do list and schedule</li> <li>get an organizer, either a PDA, or a PAA <ul> <li>it has to be portable, reliable, with calendar, daily to-do list and daily schedule, and some blank pages for long term tasks/goals</li> </ul> </li> <li>Write to-do list and schedule daily, block time for appointments of the day <ul> <li>unfinished to-do items are copied from yesterday, and adding new from a request tracking system</li> <li>then prioritize or reschedule some item</li> <li>to-do items should be manageable small chunks and with a time estimate; priority are assigned for short time and high impact tasks</li> </ul> </li> <li>Long term appointments mark in calendar of the month or year, until a day is set for it</li> </ul> <p>The key to make this works:</p> <ul> <li>there should be a single calendar for personal and work lifes (or merged calendar views) <ul> <li>reason: reduce cognitive load, and easier to confirm time</li> </ul> </li> <li>need a request tracker, not to be confused with the to-do list <ul> <li>example: RT from Best Practical (or Trello?)</li> <li>“delegate, record, do”: When issue comes, we try to delegate someone to do it, record in tracker if not urgent, or do it immediately if urgent</li> <li>tracker should write down the exact deadline up to the minute (EOD, or 9am)</li> <li>multitask are for “hurry up and wait”, e.g. download a large file, but when you wait, remember to put it down on to-do list so not forget to come back</li> </ul> </li> <li>tasks are prioritized by return of investments (ROI): do those that give biggest impact with least time involved first</li> </ul> <p>Some other ideas that not fit into anything above:</p> <ul> <li>high self-esteem allows you to take risks and give yourself the opportunity to win</li> <li>manage your boss: make sure your boss knows your goals and work together with them</li> <li>don’t solve a political issue with technology</li> <li>sometimes your boss is measured in a way that unintentionally promotes bad behavior. They should either learn how to make better metrics or not manage using metrics</li> <li>a person with martyr complex assumes that because she is paying such a great price to keep the company running, everyone owes her something. This is toxic</li> <li>long vacation helps check if documentation is good and complete</li> <li>smokers are relaxed at work, because they take breaks every couple hours</li> <li>there are status meetings (report status) and work meetings (get things done), make it clear what people should expect to prepare for the right mindset, also helps facilitators to cut off inappropriate discussions</li> <li>set meeting at strange time like 1:54pm so people will be curious to arrive on time to check out why</li> </ul>Adrian S. Tamrighthandabacus@users.github.comThis is a book I read 15 years ago. It is still very insightful re-reading it today. Perhaps because of the examples are more relevant, I found the time management advices are more useful. The theme of this book is “do not trust your brain” and therefore, we should offload stuff from our brain.Diskless Debian2021-08-07T15:41:23-04:002021-08-07T15:41:23-04:00https://www.adrian.idv.hk/diskless<p>This documents my attempt to create a diskless, PXE boot, NFS-supported Debian system. The device is an Intel NUC and it is served by OpenWRT routers. There are posts about <a href="/2020-11-24-pxeboot">PXE</a> previously. But in summary, the workflow of such system is</p> <ol> <li>Client boot with PXE stack on network card, which requests for DHCP</li> <li>DHCP server respond with IP address and the location of the boot program</li> <li>Client, according to the DHCP reply, request for the network boot program (NBP) from TFTP</li> <li>Client pass the ownership to the NBP for the next stage of boot, which in case of pxelinux, will show boot menu and load kernel</li> </ol> <p>My previous attempt using DD-WRT does not work well. OpenWRT however, has a better kernel to support NFS to make everything smooth.</p> <h2 id="dhcp-set-up">DHCP set up</h2> <p>In OpenWRT, the DHCP server is <a href="https://openwrt.org/docs/guide-user/base-system/dhcp">dnsmasq</a>, which its configuration are located at <code>/etc/config/dhcp</code> (OpenWRT specific) and <code>/etc/dnsmasq.conf</code> (dnsmasq default). When we run it as a daemon, OpenWRT will create a new config at <code>/var/etc</code> that loads the latter and apply the attributes from the former. Hence it is better to modify at <code>/etc/config/dhcp</code>. Below should be a new section appended to the end:</p> <pre><code class="language-text">config boot linux option filename 'pxelinux.0' option serveraddress '192.168.0.2' option servername 'tftpserver' </code></pre> <p>The above will be converted into a line <code>dhcp-boot=pxelinux.0,tftpserver,192.168.0.2</code> in the <code>dnsmasq.conf</code>, and delivered as DHCP option 66 (TFTP server address) and option 67 (boot filename) in the reply. In fact, we can make this more specific to one host, for example, the below will be host-based configuration based on MAC address:</p> <pre><code class="language-text">config host 'myhost' option ip '192.168.0.123' option mac '01:23:45:ab:cd:ef' option tag 'NOPXE' </code></pre> <h2 id="tftp-set-up">TFTP set up</h2> <p>TFTP is also supported in dnsmasq. What we needed to do is to add these two lines into <code>/etc/dnsmasq.conf</code>:</p> <pre><code class="language-text">enable-tftp tftp-root=/path/to/tftproot </code></pre> <p>or equivalently, add to <code>/etc/config/dhcp</code>:</p> <pre><code class="language-text"> config dnsmasq # ... option enable_tftp '1' option tftp_root '/path/to/tftproot' config dhcp 'lan' option interface 'lan' option dhcp_range '192.168.0.1,proxy' option start '100' option leasetime '12h' option limit '150' option dynamicdhcp '0' </code></pre> <p>The above configuration is for using a OpenWRT device to serve TFTP other than the one responding to DHCP. Otherwise, we do not need to modify the <code>lan</code> section to set the <code>dhcp_range</code> and <code>dynamicdhcp</code> option. But we must ensure that dnsmasq is listening to the interface that serves TFTP. It will not work if we ignored the interface for DHCP. If we set ths up in web interface in LuCI:</p> <ul> <li>in Network → Interface → LAN → DHCP server → General setup, uncheck “Ignore interface”</li> <li>in Network → Interface → LAN → DHCP server → Advanced settings, uncheck “Dynamic DHCP” and uncheck “Force”</li> <li>in Network → DHCP and DNS → General settings, uncheck “Authoritative”</li> <li>in Network → DHCP and DNS → TFTP settings, check “Enable TFTP server”, enter the full path for “TFTP server root”</li> </ul> <p>To test, run “tftp &lt;address&gt;” and then try to “get <filename>" for the filename (or path) relative to the tftp root.</filename></p> <h2 id="nfs-system">NFS system</h2> <p>We used NFSv4 under OpenWRT. Kernel server is used, as it seems more robust than the user server in the case of OpenWRT systems. The disks are mounted externally as OpenWRT device usually would not have enough built-in storage. We need to set up the mount in <code>/etc/config/fstab</code> with the following:</p> <pre><code class="language-text">config global automount option from_fstab 1 option anon_mount 1 config global autoswap option from_fstab 1 option anon_swap 0 config mount option target /tmp/extstorage option device /dev/sda1 option enabled 1 option enabled_fsck 0 option options 'noexec,noatime,nodiratime,nodev' </code></pre> <p>and for USB storage, we will also need to</p> <pre><code>opkg install usbutils kmod-usb-storage block-mount </code></pre> <p>For NFS then, we need to do</p> <pre><code>opkg install nfs-kernel-server </code></pre> <p>and then edit <code>/etc/exports</code> with</p> <pre><code class="language-text">/tmp *(ro,all_squash,insecure,no_subtree_check,sync,fsid=0) /tmp/extstorage *(rw,all_squash,insecure,no_subtree_check,nohide,sync,fsid=1) /tmp/extstorage/diskless *(rw,no_root_squash,insecure,no_subtree_check,nohide,sync,fsid=3) </code></pre> <p>Note here that the external drive are mounted under a mount point under <code>/tmp</code>, not the usual <code>/mnt</code>. It is because in OpenWRT, the <code>/mnt</code> is at an overlay mount and mounting something under there will raise an error. In <code>/tmp</code>, however, is a tmpfs which does not have such problem.</p> <p>Usually NFS will do root squash to make all files accessed through NFS to be owned by nobody (exact user ID depends on system). However, we do not want this for our diskless systems so that different users, root included, can still be distinguished.</p> <h2 id="pxelinux-configuration">PXElinux configuration</h2> <p>In the TFTP server, under the tftp root, we should make a copy of <code>pxelinux.0</code> from syslinux and put the <code>*.c32</code> file (e.g. 32-bit BIOS version) under <code>boot/isolinux/</code>. Then we can set up, for example, the below config at <code>pxelinux.cfg/default</code>:</p> <pre><code class="language-text">DEFAULT menu.c32 PROMPT 1 TIMEOUT 30 ONTIMEOUT Debian LABEL reboot MENU LABEL reboot computer COM32 reboot.c32 LABEL local MENU LABEL boot local drive LOCALBOOT 0 LABEL Debian MENU LABEL Debian Buster KERNEL vmlinuz APPEND vga=858 rw ip=dhcp initrd=initrd.img root=/dev/nfs nfsroot=192.168.0.2:/tmp/extstorage/diskless/NUC ipv6.disable=1 </code></pre> <p>Then we also need to copy the kernel (<code>vmlinuz</code>) and init ramdisk (<code>initrd.img</code>) to the tftp root. Which can be easily done if we have installed a system in another machine.</p> <h2 id="boot-disk-set-up">Boot disk set up</h2> <p>The easiest way to create a boot disk is to get a separate computer, install Debian, then create a tarball out of the root mount. Or otherwise, we can also use</p> <pre><code>debootstrap --no-merge-usr bullseye /path/install http://ftp2.us.debian.org/debian </code></pre> <p>to create a barebone Debian install. After we copy over all files into the NFS server under a dedicated directory (e.g., <code>/tmp/extstorage/diskless/NUC</code> in my case), we still need to make sure the kernel and the init ram disk are accessible from TFTP. A symbolic link of the two files into the tftp root should be suffice.</p> <p>The next thing is to prepare the copied file to be usable in the diskless environment. One dedicated copy for one machine would be the best. But at least, we need to update <code>/etc/hostname</code> and the <code>/etc/network/interfaces</code> if not using DHCP. Chroot to <code>apt-get install openssh-server</code> would also be necessary if the diskless device is also headless.</p> <p>One key thing for the kernel: It will be delivered via TFTP and the root would be mounted as NFS. Hence the kernel should support NFS access. If not, we need to create a init ram disk with those modules. In debian, what we need is to add <code>BOOT=nfs</code> into <code>/etc/initramfs-tools/initramfs.conf</code> and then run <code>mkinitramfs -d /etc/initramfs-tools -o path/to/initrd.img</code>.</p> <h2 id="first-boot">First boot</h2> <p>After all these are done, the diskless device should be able to boot with the root mounted via NFS. To test run (which fails in DD-WRT but works in OpenWRT) is to sudo and run <code>apt-get update; apt-get dist-upgrade</code>. This will test whether NFS can handle the permission correctly. Also we should check <code>/proc</code> and <code>/dev</code> to make sure the procfs and device fs are mounted correctly (local, not related to NFS)</p> <p>Cavet: Diskless system over Ethernet is slow because Ethernet is way slower than SATA connection.</p>Adrian S. Tamrighthandabacus@users.github.comThis documents my attempt to create a diskless, PXE boot, NFS-supported Debian system. The device is an Intel NUC and it is served by OpenWRT routers. There are posts about PXE previously. But in summary, the workflow of such system isHurst parameter and fractional Brownian motion2021-07-26T19:08:16-04:002021-07-26T19:08:16-04:00https://www.adrian.idv.hk/hurst<p>I was introduced to the concept of self-similarity and long-range dependency of a time series from the seminal paper <a href="http://ccr.sigcomm.org/archive/1995/jan95/ccr-9501-leland.pdf">On the Self-Similar Nature of Ethernet Traffic</a> by Leland et al (1995). The Hurst parameter or the Hurst exponent is the key behind all these.</p> <p>If we consider a Brownian motion, regardless of scale, we always have the property that the standard deviation of the process is proportional to the square root of time, namely, $$B_t - B_s \sim N(0, t-s)$$ in distribution. The Brownian motion is memoryless, hence no long-range dependency. When we generalize the Brownian motion, we can consider a zero-mean process $$B_H(t)$$ with the property</p> $\langle\vert B_H(t+\tau) - B_H(t)\vert^2\rangle \sim \tau^{2H}$ <p>namely, the mean of the square difference is proportional to the time window to the power of $$2H$$. The range of $$H$$ is from 0 to 1 and Brownian motion has $$H=0.5$$. The parameter $$H$$ is the Hurst exponent. Fractal dimension is defined in terms of Hurst exponent as $$D=2-H$$.</p> <p>In J. Feder’s book <em>Fractals</em> (1998), it accounts for how Hurst calculate the Hurst exponential for the water level in Lake Albert. Hurst denotes the influx of year $$t$$ as $$\xi(t)$$ and the discharge as $$\langle\xi\rangle_\tau$$, which</p> $\langle\xi\rangle_\tau = \frac{1}{\tau}\sum_{t=1}^\tau \xi(t)$ <p>The accumulation is therefore the running sum</p> $X(t)=\sum_{u=1}^t\left(\xi(u)-\langle\xi\rangle_\tau\right)$ <p>The range is defined as</p> $R(\tau) = \max_{t: t=1,\cdots,\tau} X(t) - \min_{t: t=1,\cdots,\tau} X(t)$ <p>and the standard deviation is defined as</p> $S=\sqrt{\frac{1}{\tau}\sum_{t=1}^\tau\left(\xi(t)-\langle\xi\rangle_\tau\right)^2}$ <p>Hurst found that, $$R/S=(\tau/2)^H$$, which the LHS is called the <em>rescaled range</em> and it is proportional to $$\tau^H$$. This can be understood intuitively if we consider the range is roughly a measure to the standard deviation, which its square is the variance and is proportional to $$\tau^{2H}$$.</p> <h2 id="determining-hurst-exponent">Determining Hurst exponent</h2> <p>If we are given a time series $$X(t)$$, how could we find its Hurst exponent (and hence tell if it is Brownian)?</p> <p>The intuitive way is using Hurst’s empirical method: With different time ranges $$\tau$$, find the rescaled range $$R/S$$ and then fit for the parameter $$H$$ using $$R/S = C\tau^H$$ for some constant $$C$$. But as the time ranges $$\tau$$ varies, we may be able to fit multiple windows in the input time series. Hence multiple rescaled range can be computed, and we can take the average for the particular $$\tau$$.</p> <p>Here is the code:</p> <pre><code class="language-python">def hurst_rs(ts, min_win=5, max_win=None): """Find Husrt exponent using rescaled range method Args: ts: The time series, as 1D numpy array min_win: Minimum window to use max_win: Maximum window to use Return: Hurst exponent as a float """ ts = np.array(ts) max_win = max_win or len(ts) win = np.unique(np.round(np.exp(np.linspace(np.log(min_win), np.log(max_win), 10))).astype(int)) rs_w = [] for tau in win: rs = [] for start in np.arange(0, len(ts)+1, tau)[:-1]: pts = ts[start:start+tau] # partial time series r = np.max(pts) - np.min(pts) # range s = np.sqrt(np.mean(np.diff(pts)**2)) # RMS of increments as standard deviation rs.append(r/s) rs_w.append(np.mean(rs)) p = np.polyfit(np.log(win), np.log(rs_w), deg=1) return p </code></pre> <p>The function would not find the rescaled range for all time window $$\tau$$ because it is too slow for practical use. Instead, it evenly takes 10 points in the log scale from the minimum to the maximum. For each $$\tau$$, <code>np.arange(0, len(ts)+1, tau)</code> generates starting points separated by one full window, hence we partitioned the time series into non-overlapping sequences of length <code>tau</code>, except the last one, which the input time series may run out, and hence discarded. Then for each partiel time series, a range is found and the root-mean-squared increment is considered as standard deviation (since we assumed the increments are having zero mean). For each $$\tau$$, the $$R/S$$ is taken as the mean of all rescaled ranges from different partial time series. Then we consider</p> $\log(R/S) = k + H\log(\tau)$ <p>and hence a linear regression (degree-1 polynomial) fitting $$\log(R/S)$$ against $$\log(\tau)$$ will produce the Hurst exponent as the order-1 coefficient.</p> <p>Another method is to use the scaling properties of a fBm:</p> <pre><code class="language-python">def hurst_sp(ts, max_lag=50): """Returns the Hurst Exponent of the time series using scaling properties""" lags = range(2, max_lag) ts = np.array(ts) stdev = [np.std(ts[tau:]-ts[:-tau]) for tau in lags] p = np.polyfit(np.log(lags), np.log(stdev), 1) return p </code></pre> <p>This is a much shorter code but it considered $$B_H(t+\tau)-B_H(t)$$, which its standard deviation is expected to be proportional to $$\tau^H$$. The difference is computed directly across the entire time series and then the standard deviation is computed. Then, as before, we fit a linear equation between the time lag and the standard deviation of the difference, and the Hurst exponent is the order-1 coefficient.</p> <p>It turns out, I found that the rescaled range method often overestimate the Hurst exponent and the scaling property method sometimes underestimates. As seen below:</p> <pre><code class="language-python">N = 2500 sigma = 0.15 dt = 1/250.0 bm = np.cumsum(np.random.randn(N)) * sigma / (N*dt) h1 = hurst_rs(bm) h2 = hurst_sp(bm) print(f"Hurst (RS): {h1:.4f}") print(f"Hurst (scaling): {h2:.4f}") print(f"Hurst (average): {(h1+h2)/2:.4f}") </code></pre> <p>This gives</p> <pre><code class="language-text">Hurst (RS): 0.5927 Hurst (scaling): 0.4783 Hurst (average): 0.5355 </code></pre> <h2 id="generating-fractional-brownian-motion">Generating fractional Brownian motion</h2> <p>What if we are given $$H$$ and generate a time series? This is more difficult then it seems. The Hurst exponent is ranged from 0 to 1, with Brownian motion is $$H=0.5$$. If $$H&lt;0.5$$, the time series is <em>mean-reverting</em>, and if $$H&gt;0.5$$, the time series is trending or with long-range dependency (LRD). The other way to understand this is that, if $$H&gt;0.5$$, the increments are positively correlated, while $$H&lt;0.5$$ then they are negatively correlated.</p> <p><a href="https://en.wikipedia.org/wiki/Fractional_Brownian_motion">Wikipedia</a> gives a few property of the fractional Brownian motion:</p> <ul> <li>self-similarity: $$B_H(at) \sim \vert a\vert^H B_H(t)$$</li> <li>stationary increment: $$B_H(t)-B_H(s) = B_H(t-s)$$</li> <li>long range dependency: if $$H&gt;0.5$$, we have $$\sum_{k=1}^\infty \mathbb{E}[B_H(1)(B_H(k+1)-B_H(k))] = \infty$$</li> <li>regularity: for any $$\epsilon&gt;0$$, there exists constant $$c$$ such that $$\vert B_H(t) - B_H(s)\vert \le c\vert t-s\vert^{H-\epsilon}$$</li> <li>covariance of increment of $$B_H(s)$$ and $$B_H(t)$$ is $$R(s,t) = \frac12(s^{2H}+t^{2H}-\vert t-s\vert^{2H})$$</li> </ul> <p>We can based on the covariance of increment to create a huge covariance matrix (each row and column corresponds to one time sample) and use the Cholesky decomposition method to generate correlated Gaussian samples. The fBm is the running sum of these samples.</p> <p>Another way to generate this is as follows, adopted from a MATLAB code:</p> <pre><code class="language-python">def fbm1d(H=0.7, n=4096, T=10): """fast one dimensional fractional Brownian motion (FBM) generator output is 'W_t' with t in [0,T] using 'n' equally spaced grid points; code uses Fast Fourier Transform (FFT) for speed. Adapted from http://www.mathworks.com.au/matlabcentral/fileexchange/38935-fractional-brownian-motion-generator Args: H: Hurst parameter, in [0,1] n: number of grid points, will be adjusted to a power of 2 by n:=2**ceil(log2(n)) T: final time Returns: W_t and t for the fBm and the time Example: W, t = fbm1d(H, n, T) plt.plot(t, W) Reference: Kroese, D. P., &amp; Botev, Z. I. (2015). Spatial Process Simulation. In Stochastic Geometry, Spatial Statistics and Random Fields(pp. 369-404) Springer International Publishing, DOI: 10.1007/978-3-319-10064-7_12 """ # sanitation assert 0&lt;H&lt;1, "Hust parameter must be between 0 and 1" n = int(np.exp2(np.ceil(np.log2(n)))) r = np.zeros(n+1) r = 1 idx = np.arange(1,n+1) r[1:] = 0.5 * ((idx+1)**(2*H) - 2*idx**(2*H) + (idx-1)**(2*H)) r = np.concatenate([r, r[-2:0:-1]]) # First row of circulant matrix lamb = np.fft.fft(r).real/(2*n) # Eigenvalues z = np.random.randn(2*n) + np.random.randn(2*n)*1j W = np.fft.fft(np.sqrt(lamb) * z) W = n**(-H) * np.cumsum(W[:n].real) # rescale W = T**H * W t = np.arange(n)/n * T # Scale for final time T return W, t </code></pre> <p>The explanation of why this works is in the article referenced above. But we can see the plot as follows:</p> <p><img src="/img/hurst.png" alt="fbm sample paths" /></p> <p>which we can see that the lower the Hurst exponent, the more fluctuating the random walk, and the higher the Hurst exponent, the smoother.</p>Adrian S. Tamrighthandabacus@users.github.comI was introduced to the concept of self-similarity and long-range dependency of a time series from the seminal paper On the Self-Similar Nature of Ethernet Traffic by Leland et al (1995). The Hurst parameter or the Hurst exponent is the key behind all these.QQ-plot and PP-plot2021-07-23T17:43:54-04:002021-07-23T17:43:54-04:00https://www.adrian.idv.hk/qqplot<p>Both QQ-plot and PP-plot are called the probability plot, but they are different. These plots are intended to compare two distributions, usually at least one of them is empirical. It is to graphically tell how good the two distributions fit.</p> <p>Assume the two distributions have the cumulative distribution functions $$F(x)$$ and $$G(x)$$, the PP-plot is to show $$G(x)$$ against $$F(x)$$ for varying $$x$$. Hence the domain and range of the plot is always from 0 to 1, as we are plotting only the range of the cumulative distribution functions.</p> <p>QQ-plot, however, is to plot the inverse cumulative distribution function $$G^{-1}(x)$$ against $$F^{-1}(x)$$ for varying $$x\in[0,1]$$. Therefore the domain and range of the plot are the support of the cumulative distribution functions $$F(x)$$ and $$G(x)$$. If we consider the data are empirical, we can see this as the plot of the order statistic of $$G(x)$$ against that of $$F(x)$$.</p> <h2 id="tools">Tools</h2> <p>In Python, it is a surprise that matplotlib does not support making PP-plot nor QQ-plot out of the box. However, it should not be difficult to see that we can make use of the order statistics to do the QQ-plot:</p> <pre><code class="language-python">import numpy as np import matplotlib.pyplot as plt N = 1000 # number of samples rv_norm = np.random.randn(N) * 2 + 1 # normal with mean 1 s.d. 2 rv_uni = np.random.rand(N) * 8 - 4 # uniform in [-4,4] plt.scatter(np.sort(rv_uni). np.sort(rv_norm), alpha=0.2, s=2) </code></pre> <p><img src="/img/qqplot-01.png" alt="QQplot" /></p> <p>PP-plot is a bit more complicated. We need to use interpolation function to achieve that. The idea is that, the CDF of an empirical distribution can be constructed using <code>np.sort(rv_norm)</code> and <code>np.linspace(0,1,N)</code>. Then with the other distribution, we can look up the value of the CDF by interpolation using <code>np.interp(x0, x, y)</code>, which is to return $$y_0 = f(x_0)$$ from the provided curve $$y=f(x)$$:</p> <pre><code class="language-python">plt.scatter(np.linspace(0,1,N), np.interp(np.sort(rv_uni), np.sort(rv_norm), np.linspace(0,1,N)), alpha=0.2, s=2) </code></pre> <p><img src="/img/qqplot-02.png" alt="PPplot" /></p> <p>However, there is a fancier tool in Python to do this. <code>statsmodels</code> has functions <code>qqplot()</code> and <code>qqplot_2samples()</code> for doing QQ-plot of one empirical against theoretical normal distribution, and between two empirical distributions, respectively. But it is just a wrapper for the more generic <code>ProbPlot</code> object. For example, this is how we can do the same as above:</p> <pre><code class="language-python">import statsmodels.api as sm _ = sm.ProbPlot(rv_uni) \ .qqplot(other=sm.ProbPlot(rv_norm), line="r", alpha=0.2, ms=2, lw=1) plt.show() _ = sm.ProbPlot(rv_uni) \ .ppplot(other=sm.ProbPlot(rv_norm), line="r", alpha=0.2, ms=2, lw=1) plt.show() </code></pre> <p>its output will go with the regression line (<code>line="r"</code>):</p> <p><img src="/img/qqplot-03.png" alt="QQplot from statsmodels" /></p> <p><img src="/img/qqplot-04.png" alt="PPplot from statsmodels" /></p> <h2 id="qq-plot-and-pp-plot-as-eda-tool">QQ-plot and PP-plot as EDA tool</h2> <p>When we get a table of data the first time, we would like to get some insight from it before further processing it. This is what the exploratory data analysis is about. For a multidimensional data set, my favorite is to run a correlogram to see how the data looks like, visually:</p> <pre><code class="language-python">import seaborn as sns sns.pairplot(df_data) </code></pre> <p><img src="/img/qqplot-05.png" alt="correlogram" /></p> <p>This graph is generated using Seaborn, a wrapper for matplotlib. We can make the graph prettier, for example, draw the regression line between each pair of data and show the CDF empirically found by KDE instead of histogram:</p> <pre><code class="language-python">sns.pairplot(df_data, diag_kind="kde", kind="reg", plot_kws={'line_kws':{'color':'red', 'alpha':0.2}, 'scatter_kws': {'alpha': 0.2, 's':4}}, ) plt.show() </code></pre> <p><img src="/img/qqplot-06.png" alt="KDE correlogram" /></p> <p>We call the same Seaborn function <code>pairplot()</code> with the <code>kind="reg"</code> (regression) for off-diagonal charts and <code>diag_kind="kde"</code> for on-diagonal charts. This tells you how correlated are any two series and the distribution of each sample. In this graph, we do not see any particularly strong correlation. So how about the series are independent but in the similar distribution? This can be answered by a PP-plot of each pair. Unfortunately, PP-plot and QQ-plot are not supported by Seaborn. Nevertheless we an add this. Here is the patch file, we need only to modify <code>axisgrid.py</code> and <code>regression.py</code>:</p> <pre><code class="language-diff">diff --git a/seaborn/axisgrid.py b/seaborn/axisgrid.py index ba70553..7a9d836 100644 --- a/seaborn/axisgrid.py +++ b/seaborn/axisgrid.py @@ -1959,6 +1959,7 @@ def pairplot( """ # Avoid circular import from .distributions import histplot, kdeplot + from .regression import qqplot, ppplot # Avoid circular import # Handle deprecations if size is not None: @@ -1992,7 +1993,7 @@ def pairplot( # Add the markers here as PairGrid has figured out how many levels of the # hue variable are needed and we don't want to duplicate that process if markers is not None: - if kind == "reg": + if kind in ["reg", "pp", "qq"]: # Needed until regplot supports style if grid.hue_names is None: n_markers = 1 @@ -2020,6 +2021,10 @@ def pairplot( diag_kws.setdefault("fill", True) diag_kws.setdefault("warn_singular", False) grid.map_diag(kdeplot, **diag_kws) + elif diag_kind == "pp": + grid.map_diag(ppplot, **diag_kws) + elif diag_kind == "qq": + grid.map_diag(qqplot, **diag_kws) # Maybe plot on the off-diagonals if diag_kind is not None: @@ -2030,6 +2035,10 @@ def pairplot( if kind == "scatter": from .relational import scatterplot # Avoid circular import plotter(scatterplot, **plot_kws) + elif kind == "qq": + plotter(qqplot, **plot_kws) + elif kind == "pp": + plotter(ppplot, **plot_kws) elif kind == "reg": from .regression import regplot # Avoid circular import plotter(regplot, **plot_kws) diff --git a/seaborn/regression.py b/seaborn/regression.py index ce21927..cc366d1 100644 --- a/seaborn/regression.py +++ b/seaborn/regression.py @@ -20,7 +20,7 @@ from .axisgrid import FacetGrid, _facet_docs from ._decorators import _deprecate_positional_args -__all__ = ["lmplot", "regplot", "residplot"] +__all__ = ["lmplot", "regplot", "residplot", "ppplot", "qqplot"] class _LinearPlotter(object): @@ -833,6 +833,91 @@ lmplot.__doc__ = dedent("""\ """).format(**_regression_docs) +@_deprecate_positional_args +def qqplot( + *, + x=None, y=None, + data=None, + x_estimator=None, x_bins=None, x_ci="ci", + scatter=True, fit_reg=True, ci=95, n_boot=1000, units=None, + seed=None, order=1, logistic=False, lowess=False, robust=False, + logx=False, x_partial=None, y_partial=None, + truncate=True, dropna=True, x_jitter=None, y_jitter=None, + label=None, color=None, marker="o", + scatter_kws=None, line_kws=None, ax=None, + legend=None +): + + plotter = _RegressionPlotter(x, y, data, x_estimator, x_bins, x_ci, + scatter, fit_reg, ci, n_boot, units, seed, + order, logistic, lowess, robust, logx, + x_partial, y_partial, truncate, dropna, + x_jitter, y_jitter, color, label) + + # Manipulate input data for plotting + if plotter.x is None: + err = "missing x or y in plot data" + raise ValueError(err) + if plotter.y is None: + # set it to normal distribution scaled according to x + from scipy.stats import norm + plotter.y = norm.ppf(np.linspace(0,1,len(plotter.x)+2)[1:-1]) + plotter.y = plotter.y * plotter.x.std() + plotter.x.mean() + plotter.x = np.sort(plotter.x) + plotter.y = np.sort(plotter.y) + + if ax is None: + ax = plt.gca() + + scatter_kws = {} if scatter_kws is None else copy.copy(scatter_kws) + scatter_kws["marker"] = marker + line_kws = {} if line_kws is None else copy.copy(line_kws) + plotter.plot(ax, scatter_kws, line_kws) + return ax + + +@_deprecate_positional_args +def ppplot( + *, + x=None, y=None, + data=None, + x_estimator=None, x_bins=None, x_ci="ci", + scatter=True, fit_reg=True, ci=95, n_boot=1000, units=None, + seed=None, order=1, logistic=False, lowess=False, robust=False, + logx=False, x_partial=None, y_partial=None, + truncate=True, dropna=True, x_jitter=None, y_jitter=None, + label=None, color=None, marker="o", + scatter_kws=None, line_kws=None, ax=None, + legend=None +): + + plotter = _RegressionPlotter(x, y, data, x_estimator, x_bins, x_ci, + scatter, fit_reg, ci, n_boot, units, seed, + order, logistic, lowess, robust, logx, + x_partial, y_partial, truncate, dropna, + x_jitter, y_jitter, color, label) + + # Manipulate input data for plotting + if plotter.x is None: + err = "missing x in plot data" + raise ValueError(err) + if plotter.y is None: + # set it to normal distribution + from scipy.stats import norm + plotter.y = norm.ppf(np.linspace(0,1,len(plotter.x)+2)[1:-1]) + linspace = np.linspace(0,1,len(plotter.x)) + plotter.y = np.interp(np.sort(plotter.x), np.sort(plotter.y), linspace) + plotter.x = linspace + if plotter.fit_reg: + plotter.x_range = (0, 1) + + if ax is None: + ax = plt.gca() + + scatter_kws = {} if scatter_kws is None else copy.copy(scatter_kws) + scatter_kws["marker"] = marker + line_kws = {} if line_kws is None else copy.copy(line_kws) + plotter.plot(ax, scatter_kws, line_kws) + return ax + + @_deprecate_positional_args def regplot( *, </code></pre> <p>The key changes are the functions <code>sns.ppplot()</code> and <code>sns.qqplot()</code> defined in <code>regression.py</code>, which are modified from the function <code>regplot()</code>. The function <code>regplot()</code> is to do a scatter plot, then make a regression line on top of it. As we saw, PP-plot and QQ-plot are only the modified scatter plot. Therefore we manipulate the data in the plotter using <code>np.sort()</code> and <code>np.interp()</code> before we invoked its <code>plot()</code> function. At this point, these two functions can compare <em>two</em> empirical distributions. However, in the <code>pairplot()</code>, the diagonal charts are handled differently – be it a KDE plot or histogram plot. We can indeed make the PP-plot and QQ-plot a single distribution plot by plotting it against a theoretical normal distribution. The way we do it is to generate one if there is not the second distribution (<code>y</code>): Using the inverse normal CDF function <code>norm.ppf()</code> from scipy, we look for the evenly distributed values from 0 to 1 (clipped the two ends as we know they will be infinite). In case of QQ-plot, the data should be scaled according to the input data to match the mean and standard deviation. The plot will be as follows:</p> <pre><code class="language-python">sns.pairplot(df_data, diag_kind="pp", kind="pp", plot_kws={'line_kws':{'color':'red', 'alpha':0.2}, 'scatter_kws': {'alpha': 0.2, 's':4}}, diag_kws={'line_kws':{'color':'red', 'alpha':0.2}, 'scatter_kws': {'alpha': 0.2, 's':4}}) plt.show() </code></pre> <p><img src="/img/qqplot-07.png" alt="PPplot from seaborn" /></p> <pre><code class="language-python">sns.pairplot(df_data, diag_kind="qq", kind="qq", plot_kws={'line_kws':{'color':'red', 'alpha':0.2}, 'scatter_kws': {'alpha': 0.2, 's':4}}, diag_kws={'line_kws':{'color':'red', 'alpha':0.2}, 'scatter_kws': {'alpha': 0.2, 's':4}}) plt.show() </code></pre> <p><img src="/img/qqplot-08.png" alt="QQplot from seaborn" /></p> <p>Of course, if you just want one PP-plot, you can use <code>sns.ppplot()</code> directly.</p> <h2 id="better-solution-by-pairgrid">Better solution by PairGrid</h2> <p>The <code>pairplot()</code> function in seaborn is built with the <a href="https://seaborn.pydata.org/generated/seaborn.PairGrid.html"><code>PairGrid</code> object</a>. We can use it directly to make the pair plots but it would not be a single function to achieve that. However, there would be more flexibility. This is an example from the seaborn documentation that allows you to show different chart above and below the diagonal:</p> <pre><code class="language-python"># Fancier correlogram g = sns.PairGrid(df) g.map_upper(sns.scatterplot) g.map_lower(sns.kdeplot) g.map_diag(sns.kdeplot) plt.show() </code></pre> <p><img src="/img/qqplot-09.png" alt="pairgrid from seaborn" /></p> <p>There can be <code>PairGrid.map()</code> for all charts in the grid, <code>map_diag()</code> and <code>map_offdiag()</code> for on- and off-diagonal charts, or <code>map_upper()</code> and <code>map_lower()</code> for the upper- and lower-triangular charts of the grid. We can also provide a custom function to the family of <code>PairGrid.map()</code>. The exception of <code>map_diag()</code> is to expect the custom function to take the $$x$$ and $$y$$ data as positional arguments, while <code>map_diag()</code> takes only $$x$$ data. With this, we can define a few functions to be used:</p> <pre><code class="language-python">import matplotlib.pyplot as plt import seaborn as sns import numpy as np from scipy.stats import norm def qqplot(x, y=None, **kwargs): """Create Q-Q plot, for use with PairGrid.map() Args: x, y: Passed as positional arguments, will be pandas series kwargs: Can be "color" (tuple of 3 floats) and "label" (str), or other keyword arguments passed by user """ # Find scatter locations _, xs = stats.probplot(x, fit=False) if y is None: # set it to normal distribution scaled according to x ys = norm.ppf(np.linspace(0,1,len(xs)+2)[1:-1]) ys = ys * xs.std() + xs.mean() else: _, ys = stats.probplot(y, fit=False) # Find regression line p = np.polyfit(xs, ys, 1) xr = np.linspace(np.min(xs), np.max(xs), 2) yr = np.polyval(p, xr) # Plot options line_kws = kwargs.pop('line_kws', dict(c="r", ls="--", alpha=0.5)) scatter_kws = kwargs.pop('scatter_kws', {}) scatter_kws.update(kwargs) # all remaining keyword arg assumed for scatter # Plot plt.axline(xy1=(xr,yr), xy2=(xr,yr), **line_kws) sns.scatterplot(x=xs, y=ys, **scatter_kws) def ppplot(x, y=None, **kwargs): """Create P-P plot, for use with PairGrid.map() Args: x, y: Passed as positional arguments, will be pandas series kwargs: Can be "color" (tuple of 3 floats) and "label" (str), or other keyword arguments passed by user """ # Find scatter locations xs = np.linspace(0, 1, len(x)) if y is None: # set it to normal distribution scaled according to x y = norm.ppf(np.linspace(0,1,len(xs)+2)[1:-1]) ys = np.interp(np.sort(x), np.sort(y), xs) # Find regression line p = np.polyfit(xs, ys, 1) xr = np.linspace(np.min(xs), np.max(xs), 2) yr = np.polyval(p, xr) # Plot options line_kws = kwargs.pop('line_kws', dict(c="r", ls="--", alpha=0.5)) scatter_kws = kwargs.pop('scatter_kws', {}) scatter_kws.update(kwargs) # all remaining keyword arg assumed for scatter # Plot plt.axline(xy1=(xr,yr), xy2=(xr,yr), **line_kws) sns.scatterplot(x=xs, y=ys, **scatter_kws) def tukeyplot(x, y, **kwargs): """Create Tukey mean-difference plot, for use with PairGrid.map() Args: x, y: Passed as positional arguments, will be pandas series kwargs: Can be "color" (tuple of 3 floats) and "label" (str), or other keyword arguments passed by user """ # Find locations xs = (x+y)/2 ys = x-y mean = np.mean(ys) sd = np.std(ys) # Plot options line_kws = kwargs.pop('line_kws', dict(ls="--", alpha=0.5)) scatter_kws = kwargs.pop('scatter_kws', {}) scatter_kws.update(kwargs) # all remaining keyword arg assumed for scatter # Plot the mean and 95% ci lines xl = np.linspace(np.min(xs), np.max(xs), 4) plt.axhline(y=mean, c="b", **line_kws) plt.axhline(y=mean+1.96*sd, color="r", **line_kws) plt.axhline(y=mean-1.96*sd, color="r", **line_kws) sns.scatterplot(x=xs, y=ys, **scatter_kws) </code></pre> <p>and we can produce a PP-plot with them:</p> <pre><code class="language-python">g = sns.PairGrid(df) g.map_diag(ppplot, line_kws=dict(c="r", alpha=0.3, ls="--"), scatter_kws=dict(alpha=0.5, s=5)) g.map_offdiag(ppplot, scatter_kws=dict(alpha=0.5, s=5)) g.tight_layout() plt.show() </code></pre> <p><img src="/img/qqplot-11.png" alt="PPplot using PairGrid" /></p> <p>and a QQ-plot:</p> <pre><code class="language-python">g = sns.PairGrid(df) g.map_diag(qqplot, scatter_kws=dict(alpha=0.5, s=5)) g.map_offdiag(qqplot, scatter_kws=dict(alpha=0.5, s=5)) g.tight_layout() plt.show() </code></pre> <p><img src="/img/qqplot-10.png" alt="QQplot using PairGrid" /></p> <p>and also Tukey mean-difference plot with PP-plot on diagonal, but we need to detach the shared axis for the diagonal plots due to different scale:</p> <pre><code class="language-python">from itertools import product g = sns.PairGrid(df) g.map_offdiag(tukeyplot, scatter_kws=dict(alpha=0.5, s=5)) # Remove shared axis only for diagonal for i,j,k in product(range(len(df.columns)), repeat=3): g.axes[i, i].get_shared_x_axes().remove(g.axes[j, k]) g.axes[i, i].get_shared_y_axes().remove(g.axes[j, k]) g.map_diag(ppplot, line_kws=dict(alpha=0.3)) g.tight_layout() plt.show() </code></pre> <p><img src="/img/qqplot-12.png" alt="Tukey mean-difference plot using PairGrid" /></p>Adrian S. Tamrighthandabacus@users.github.comBoth QQ-plot and PP-plot are called the probability plot, but they are different. These plots are intended to compare two distributions, usually at least one of them is empirical. It is to graphically tell how good the two distributions fit.Interpreting linear regression summary from statsmodels2021-07-16T11:54:36-04:002021-07-16T11:54:36-04:00https://www.adrian.idv.hk/statsmodels<p>The python package statsmodels has OLS functions to fit a linear regression problem. How well the linear regression is fitted, or whether the data fits a linear model, is often a question to be asked. The way to tell is to use some statistics, which by default the OLS module produces a few in its summary.</p> <p>This is an example of using statsmodels to fit a linear regression:</p> <pre><code class="language-python">import statsmodels.api as sm import numpy as np import pandas as pd X1 = np.random.rand(200)*3.1 X2 = np.random.rand(200)*4.1 X3 = np.random.rand(200)*5.9 X4 = np.random.rand(200)*2.6 X5 = np.random.rand(200)*5.3 Y0 = 0.58*X1 - 0.97*X2 + 0.93*X3 - 2.3 err = np.random.randn(200) df = pd.DataFrame(dict(X1=X1, X2=X2, X3=X3, X4=X4, X5=X5, Y=Y0+err)) model = sm.OLS(df["Y"], sm.add_constant(df[["X1","X2","X3","X4","X5"]]), missing="drop").fit() print(model.summary2()) </code></pre> <p>We print the summary using <code>summary2()</code> function instead of <code>summary()</code> function because it looks more compact, but the result should be the same. This is how the above looks like:</p> <pre><code class="language-text"> Results: Ordinary least squares ================================================================= Model: OLS Adj. R-squared: 0.799 Dependent Variable: Y AIC: 572.1603 Date: 2021-07-16 11:49 BIC: 591.9502 No. Observations: 200 Log-Likelihood: -280.08 Df Model: 5 F-statistic: 159.0 Df Residuals: 194 Prob (F-statistic): 1.27e-66 R-squared: 0.804 Scale: 0.99341 ------------------------------------------------------------------- Coef. Std.Err. t P&gt;|t| [0.025 0.975] ------------------------------------------------------------------- const -2.2590 0.2889 -7.8187 0.0000 -2.8288 -1.6892 X1 0.6440 0.0848 7.5968 0.0000 0.4768 0.8112 X2 -0.9834 0.0595 -16.5186 0.0000 -1.1009 -0.8660 X3 0.8920 0.0445 20.0478 0.0000 0.8043 0.9798 X4 -0.0200 0.0921 -0.2167 0.8287 -0.2015 0.1616 X5 -0.0209 0.0465 -0.4486 0.6542 -0.1126 0.0709 ----------------------------------------------------------------- Omnibus: 0.319 Durbin-Watson: 1.825 Prob(Omnibus): 0.853 Jarque-Bera (JB): 0.471 Skew: 0.030 Prob(JB): 0.790 Kurtosis: 2.770 Condition No.: 22 ================================================================= </code></pre> <p>Showing the names of the dependent and independent variables are supported if the data are provided as pandas dataframe. We can see that the summary screen above has three sections, and the elements in each are explained as follows:</p> <p>First section: The statistics of the overall linear model. In a linear regression of fitting $$y = \beta^T X + \epsilon$$ using $$N$$ data points with $$p$$ regressor and one regressand, the value $$\hat{y}_i$$ as predicted by the model, we have the RSS (residual sum of square) defined as $$RSS=\sum_i (y_i-\hat{y}_i)^2$$ and the ESS (explained sum of square) defined as $$ESS = \sum_i (\hat{y}_i - \bar{y})^2$$, and the total sum of square is $$TSS=ESS+RSS=\sum_i(y_i-\bar{y})^2$$. The items on the first section of the summary are:</p> <ul> <li>No. Observations: The number of data points $$N$$</li> <li>Df model: Number of parameters in the model $$p$$ <ul> <li>statsmodels can take string-typed categorical variables in regression. In that case, one-hot encoding would be used and the number of parameters will be expanded by the number of categories in such variables</li> </ul> </li> <li>Df residuals: Degree of freedom of the residuals, equals to $$N-p-1$$</li> <li>R-squared: $$R^2 = 1-\dfrac{RSS}{TSS} = 1-\dfrac{\sum_i (y_i-\hat{y}_i)^2}{\sum_i (y_i-\bar{y})^2}$$ as the coefficient of determination</li> <li>adjusted R-squared: $$\bar{R}^2 = 1-\dfrac{RSS/df_e}{TSS/df_t}=1-(1-R^2)\dfrac{n-1}{n-p-1}$$ where $$df_t=N-1$$ is the degrees of freedom of the estimate of the population variance of the dependent variable, and $$df_e = n-k-1$$ is the degrees of freedom of the estimate of the underlying population error variance</li> <li>Log-Likelihood: $$\log p(X|\mu,\Sigma)=\sum_{i=1}^N\log\mathcal{N}(e_i|\mu_i,\Sigma_i)$$. Assumed the model is correct, the log of the probability that the set of data is produced by the model</li> <li>AIC: Akaike Information Criterion, $$-2\log L + kp$$ with $$k=2$$. It depends on the log-likelihood $$\log L$$ and estimates the relative distance between the unknown true likelihood and the fitted likelihood. The lower the AIC means the closer to the truth</li> <li>BIC: Bayesian Information Criterion, $$-2\log L + kp$$ with $$k=\log(N)$$. Based on a Bayesian set up and meansures the posterior probability of a model being true. The lower the BIC means the closer to the truth <ul> <li>BIC penalizes the model complexity more heavily (usually $$\log N&gt;2$$) than AIC, hence AIC may prefer a bigger model compared to BIC</li> <li>AIC is better in situations when false negatives are more misleading than a false positive; BIC is better in situations when false positive is more misleading than a false negative</li> </ul> </li> <li>F-statistic and Prob (F-statsitic): The null hypothesis that all the coefficients of regressors are zero, hence a high p-value means the model is more significant</li> <li>Scale: The scale factor of the covariance matrix, $$\dfrac{RSS}{N-p}$$</li> </ul> <p>The second section: Coefficients determined by the regression.</p> <ul> <li>Coef: Coefficient determined by OLS regression, it is solved analytically with $$\beta=(X^TX)^{-1}X^Ty$$</li> <li>Std Err: Estimate of the standard deviation of the coefficient, $$\hat\sigma^2_j = \hat\sigma^2[Q_{xx}^{-1}]_{jj}$$ with $$Q_{xx}=X^TX$$ and $$\hat\sigma^2=\dfrac{\epsilon^T\epsilon}{N}$$</li> <li>t: Coef divided by Std Err, i.e., the t statistic, with the null hypothesis that this particular coefficient is zero. It is used as a measurement of whether theh coefficient is significant. A coefficient is significant if its magnitude is large with small standard error <ul> <li>the t statistic with the null hypothesis that the coefficient $$\beta$$ equals to $$k$$ is $$t=(\beta-k)/SE$$, here we took $$k=0$$</li> </ul> </li> <li>P&gt;|t|: the p-value of the t test, i.e., the probability that the variable has no effect on the dependent variable as the null hypothesis is true <ul> <li>degree of freedom for the t test is $$n-2$$ for $$n$$ the number of observations</li> </ul> </li> <li>0.025 and 0.975: The two boundaries of the coefficient at 95% confidence interval, approximately mean value of the coefficient ±2 standard error</li> </ul> <p>The third section: Normality of the residuals. Linear regression is built based on the assumption that $$\epsilon$$ is normally distributed with zero mean.</p> <ul> <li>Omnibus: D’Agostino’s $$K^2$$ test, based on skew and kurtosis. Perfect normality will produce 0</li> <li>Prob(Ominbus): Probability that the residuals are normally distributed according to omnibus statistic</li> <li>Skew: Skewness (symmetry) of the residual, 0 if perfect symmetry</li> <li>Kurtosis: Peakiness of the residual (concentration around 0), higher kurtosis means fewer outliers. Normal distribution will gives 3 here</li> <li>Durbin-Watson: Test for autocorrelation in the residuals or the homoscedasticity, i.e., whether the error are independent of each other and even throughout the data <ul> <li>if relative error is higher when the data points are higher, then the error is not even</li> <li>ideal measure is between 1 and 2</li> </ul> </li> <li>Jarque-Bera (JB) and Prob(JB): also a normality test using skewness and kurtosis, as an alternative way to omnibus statistic <ul> <li>we need JB and Omnibus mutually confirm with each other</li> </ul> </li> <li>Condition no.: Measurement of sensitivity of the model compared to the size of changes in the data <ul> <li>multicollinearity (i.e., two independent variables are linearly related) has high condition number</li> </ul> </li> </ul> <p>Knowing what each of these elements measures, we can see how well the model fits. Here we try to change the code to give a different summary:</p> <p>If we use fewer regressor in the input, we should see a lowered AIC and BIC because the omitted regressors did not really involved:</p> <pre><code class="language-python">model = sm.OLS(df["Y"], sm.add_constant(df[["X1","X2","X3"]]), missing="drop").fit() print(model.summary2()) </code></pre> <p>Result as follows, which the AIC and BIC are lowered a bit due to lowered df model (simpler model), but the $$R^2$$ has not changed:</p> <pre><code class="language-text"> Results: Ordinary least squares ================================================================= Model: OLS Adj. R-squared: 0.801 Dependent Variable: Y AIC: 568.4052 Date: 2021-07-16 11:51 BIC: 581.5985 No. Observations: 200 Log-Likelihood: -280.20 Df Model: 3 F-statistic: 267.3 Df Residuals: 196 Prob (F-statistic): 5.35e-69 R-squared: 0.804 Scale: 0.98447 ------------------------------------------------------------------- Coef. Std.Err. t P&gt;|t| [0.025 0.975] ------------------------------------------------------------------- const -2.3391 0.2294 -10.1962 0.0000 -2.7915 -1.8867 X1 0.6385 0.0836 7.6355 0.0000 0.4735 0.8034 X2 -0.9812 0.0591 -16.6130 0.0000 -1.0977 -0.8647 X3 0.8921 0.0443 20.1416 0.0000 0.8048 0.9795 ----------------------------------------------------------------- Omnibus: 0.378 Durbin-Watson: 1.826 Prob(Omnibus): 0.828 Jarque-Bera (JB): 0.526 Skew: 0.029 Prob(JB): 0.769 Kurtosis: 2.755 Condition No.: 14 ================================================================= </code></pre> <p>Indeed if we check the p-value of the t test in the previous output, we can see that they are high and the null hypothesis is not rejected for X4 and X5, hinting that these two regressors should not be included in the model.</p> <p>If we skew the error by taking its absolute value, the error distribution is no longer normal:</p> <pre><code class="language-python">df = pd.DataFrame(dict(X1=X1, X2=X2, X3=X3, X4=X4, X5=X5, Y=Y0+np.abs(err))) model = sm.OLS(df["Y"], sm.add_constant(df[["X1","X2","X3","X4","X5"]]), missing="drop").fit() print(model.summary2()) </code></pre> <p>Result as follows. We see that the $$R^2$$ is higher (because the range of error is smaller now) but the test of normality in the residual has low p-value in both the omnibus test and the Jarque-Bera statistic. Hence we concluded that the residual is not normal. This is why the coefficients found deviated from the truth.</p> <pre><code class="language-text"> Results: Ordinary least squares ================================================================== Model: OLS Adj. R-squared: 0.922 Dependent Variable: Y AIC: 359.9204 Date: 2021-07-16 11:52 BIC: 379.7103 No. Observations: 200 Log-Likelihood: -173.96 Df Model: 5 F-statistic: 474.7 Df Residuals: 194 Prob (F-statistic): 1.02e-106 R-squared: 0.924 Scale: 0.34376 -------------------------------------------------------------------- Coef. Std.Err. t P&gt;|t| [0.025 0.975] -------------------------------------------------------------------- const -1.2735 0.1700 -7.4931 0.0000 -1.6087 -0.9383 X1 0.4774 0.0499 9.5733 0.0000 0.3790 0.5757 X2 -1.0152 0.0350 -28.9883 0.0000 -1.0843 -0.9461 X3 0.9284 0.0262 35.4709 0.0000 0.8768 0.9801 X4 -0.0195 0.0542 -0.3606 0.7188 -0.1264 0.0873 X5 0.0183 0.0274 0.6691 0.5042 -0.0357 0.0723 ------------------------------------------------------------------ Omnibus: 21.305 Durbin-Watson: 2.091 Prob(Omnibus): 0.000 Jarque-Bera (JB): 24.991 Skew: 0.854 Prob(JB): 0.000 Kurtosis: 3.291 Condition No.: 22 ================================================================== </code></pre> <p>If we introduce multilinearity, statsmodels will produce a vastly large conditon number and warn us about the result:</p> <pre><code class="language-python">df = pd.DataFrame(dict(X1=X1, X2=X2, X3=X3, X4=X2-2*X3, X5=X1+0.5*X2, Y=Y0+(Y0**2)*err)) model = sm.OLS(df["Y"], sm.add_constant(df[["X1","X2","X3","X4","X5"]]), missing="drop").fit() print(model.summary2()) </code></pre> <p>with the result as follows, we can see that all coefficients are significant according to the p-value of t test but indeed only the first 3 are independent. The condition number suggested that these set of coefficient is not stable.</p> <pre><code class="language-text"> Results: Ordinary least squares ================================================================= Model: OLS Adj. R-squared: 0.801 Dependent Variable: Y AIC: 568.4052 Date: 2021-07-16 13:07 BIC: 581.5985 No. Observations: 200 Log-Likelihood: -280.20 Df Model: 3 F-statistic: 267.3 Df Residuals: 196 Prob (F-statistic): 5.35e-69 R-squared: 0.804 Scale: 0.98447 ------------------------------------------------------------------- Coef. Std.Err. t P&gt;|t| [0.025 0.975] ------------------------------------------------------------------- const -2.3391 0.2294 -10.1962 0.0000 -2.7915 -1.8867 X1 0.4671 0.0473 9.8842 0.0000 0.3739 0.5603 X2 -0.5917 0.0498 -11.8909 0.0000 -0.6898 -0.4935 X3 -0.0582 0.0243 -2.3936 0.0176 -0.1062 -0.0103 X4 -0.4752 0.0172 -27.6363 0.0000 -0.5091 -0.4413 X5 0.1713 0.0396 4.3213 0.0000 0.0931 0.2495 ----------------------------------------------------------------- Omnibus: 0.378 Durbin-Watson: 1.826 Prob(Omnibus): 0.828 Jarque-Bera (JB): 0.526 Skew: 0.029 Prob(JB): 0.769 Kurtosis: 2.755 Condition No.: 24475138936904036 ================================================================= * The condition number is large (2e+16). This might indicate strong multicollinearity or other numerical problems. </code></pre> <p>We can also create heteroscedasticity by making residual larger when the regressand is small:</p> <pre><code class="language-python">df = pd.DataFrame(dict(X1=X1, X2=X2, X3=X3, X4=X4, X5=X5, Y=Y0+err/Y0)) model = sm.OLS(df["Y"], sm.add_constant(df[["X1","X2","X3","X4","X5"]]), missing="drop").fit() print(model.summary2()) </code></pre> <p>The result as follows, which we can see the Durbin-Watson statistic is larger than 2, and as a result, the residual is not normally distributed as well:</p> <pre><code class="language-text"> Results: Ordinary least squares ================================================================== Model: OLS Adj. R-squared: 0.074 Dependent Variable: Y AIC: 1330.7666 Date: 2021-07-16 13:16 BIC: 1350.5565 No. Observations: 200 Log-Likelihood: -659.38 Df Model: 5 F-statistic: 4.177 Df Residuals: 194 Prob (F-statistic): 0.00126 R-squared: 0.097 Scale: 44.098 -------------------------------------------------------------------- Coef. Std.Err. t P&gt;|t| [0.025 0.975] -------------------------------------------------------------------- const -1.5268 1.9250 -0.7932 0.4287 -5.3235 2.2698 X1 1.2981 0.5648 2.2983 0.0226 0.1841 2.4120 X2 -1.0072 0.3967 -2.5393 0.0119 -1.7896 -0.2249 X3 0.7941 0.2965 2.6786 0.0080 0.2094 1.3788 X4 -0.3668 0.6134 -0.5979 0.5506 -1.5766 0.8431 X5 -0.2874 0.3100 -0.9271 0.3550 -0.8987 0.3240 ------------------------------------------------------------------ Omnibus: 147.586 Durbin-Watson: 2.232 Prob(Omnibus): 0.000 Jarque-Bera (JB): 9060.224 Skew: 2.033 Prob(JB): 0.000 Kurtosis: 35.721 Condition No.: 22 ================================================================== </code></pre> <p>We can also do a nonlinear model:</p> <pre><code class="language-python">Y0 = 0.58*X1 - 0.97*X2 + 0.93*X3**2 - 2.3 df = pd.DataFrame(dict(X1=X1, X2=X2, X3=X3, X4=X4, X5=X5, Y=Y0+err)) model = sm.OLS(df["Y"], sm.add_constant(df[["X1","X2","X3","X4","X5"]]), missing="drop").fit() print(model.summary2()) </code></pre> <p>which we take the squared of X3 and the result is as follows. Because of the nonlinear model, the residual is no longer normally distributed. The $$R^2$$ here is larger than before. Hence we should be cautious not to merely select a model based on the coefficient of determination.</p> <pre><code class="language-text"> Results: Ordinary least squares ================================================================== Model: OLS Adj. R-squared: 0.930 Dependent Variable: Y AIC: 926.7164 Date: 2021-07-16 13:31 BIC: 946.5063 No. Observations: 200 Log-Likelihood: -457.36 Df Model: 5 F-statistic: 532.4 Df Residuals: 194 Prob (F-statistic): 3.37e-111 R-squared: 0.932 Scale: 5.8484 -------------------------------------------------------------------- Coef. Std.Err. t P&gt;|t| [0.025 0.975] -------------------------------------------------------------------- const -7.9247 0.7010 -11.3043 0.0000 -9.3074 -6.5421 X1 0.5560 0.2057 2.7031 0.0075 0.1503 0.9616 X2 -1.0398 0.1445 -7.1978 0.0000 -1.3247 -0.7549 X3 5.4317 0.1080 50.3107 0.0000 5.2187 5.6446 X4 0.2395 0.2234 1.0720 0.2850 -0.2011 0.6801 X5 -0.0700 0.1129 -0.6198 0.5361 -0.2926 0.1527 ------------------------------------------------------------------ Omnibus: 12.714 Durbin-Watson: 1.895 Prob(Omnibus): 0.002 Jarque-Bera (JB): 13.907 Skew: 0.631 Prob(JB): 0.001 Kurtosis: 2.727 Condition No.: 22 ================================================================== </code></pre> <p>Finally, we can try to use the error as the regressand and see the F statistic became low (or its p-value became high):</p> <pre><code class="language-python">df = pd.DataFrame(dict(X1=X1, X2=X2, X3=X3, X4=X4, X5=X5, Y=err)) model = sm.OLS(df["Y"], sm.add_constant(df[["X1","X2","X3","X4","X5"]]), missing="drop").fit() print(model.summary2()) </code></pre> <p>result:</p> <pre><code class="language-text"> Results: Ordinary least squares ================================================================= Model: OLS Adj. R-squared: -0.018 Dependent Variable: Y AIC: 572.1603 Date: 2021-07-16 13:36 BIC: 591.9502 No. Observations: 200 Log-Likelihood: -280.08 Df Model: 5 F-statistic: 0.2807 Df Residuals: 194 Prob (F-statistic): 0.923 R-squared: 0.007 Scale: 0.99341 ------------------------------------------------------------------- Coef. Std.Err. t P&gt;|t| [0.025 0.975] ------------------------------------------------------------------- const 0.0410 0.2889 0.1419 0.8873 -0.5288 0.6108 X1 0.0640 0.0848 0.7547 0.4513 -0.1032 0.2312 X2 -0.0134 0.0595 -0.2257 0.8217 -0.1309 0.1040 X3 -0.0380 0.0445 -0.8531 0.3947 -0.1257 0.0498 X4 -0.0200 0.0921 -0.2167 0.8287 -0.2015 0.1616 X5 -0.0209 0.0465 -0.4486 0.6542 -0.1126 0.0709 ----------------------------------------------------------------- Omnibus: 0.319 Durbin-Watson: 1.825 Prob(Omnibus): 0.853 Jarque-Bera (JB): 0.471 Skew: 0.030 Prob(JB): 0.790 Kurtosis: 2.770 Condition No.: 22 ================================================================= </code></pre>Adrian S. Tamrighthandabacus@users.github.comThe python package statsmodels has OLS functions to fit a linear regression problem. How well the linear regression is fitted, or whether the data fits a linear model, is often a question to be asked. The way to tell is to use some statistics, which by default the OLS module produces a few in its summary.Bokeh, interactive widgets, and jupyterlab2021-07-13T21:37:24-04:002021-07-13T21:37:24-04:00https://www.adrian.idv.hk/jupyter<p>Jupyter notebooks and visualization are natural marriage. It is more fun if we can skew this or that a bit by turning a knob or selecting something from a drop down. This is where so called <em>interactive widgets</em> come to play. There are a lot of examples on how to set up a widget and control the matplotlib chart interactively. Doing so in jupyterlab, however, is not so straightforward.</p> <h2 id="matplotlib-and-the-widgets">matplotlib and the widgets</h2> <p>Jupyter notebook widgets are just come control elements for user interaction. They receive user input and trigger events, which then can invoke some function. To use widgets to control matplotlib graphics, we have to understand what are the matplotlib backends.</p> <pre><code class="language-text">%matplotlib --list </code></pre> <p>This, if run in jupyter, will list out all backends. In my case,</p> <pre><code class="language-text">Available matplotlib backends: ['tk', 'gtk', 'gtk3', 'wx', 'qt4', 'qt5', 'qt', 'osx', 'nbagg', 'notebook', 'agg', 'svg', 'pdf', 'ps', 'inline', 'ipympl', 'widget'] </code></pre> <p>Amongst them the <code>inline</code> backend is the dumbest. Which just render the plot and make it read-only. Therefore, no update is allowed on the chart, but you can always clear it and redraw. The <code>notebook</code> backend makes the matplotlib output aware of the Jupyter environment and the charts can be updated. The <code>widget</code> and <code>ipympl</code> backend are similar to that of <code>notebook</code>, but fancier. They make the matplotlib output as a widget that you can pan or zoom. Using matplotlib with different backend requires the interactive widgets to be configured differently.</p> <p>The widgets on jupyter is from the module <a href="https://ipywidgets.readthedocs.io/en/latest/">ipywidgets</a>. The simplest example (without graphics!) is as follows:</p> <pre><code class="language-python">from ipywidgets import interact def f(x): return x**2 interact(f, x=10.0); </code></pre> <p>Running this in a jupyter notebook will give you a slider and a row of text (for printing the output of the function):</p> <p><img src="/img/jupyter-01.png" alt="" /></p> <p>a equivalent way of the above would be the following snippet, which use <code>interact()</code> as a decorator:</p> <pre><code class="language-python">from ipywidgets import interact, widgets @interact(x=widgets.FloatSlider(min=-10, max=30, step=0.1, value=10)) def f(x): return x**2 </code></pre> <p>The <code>interact()</code> decorator accepts keyword arguments that match with the function. You may create the widget explicitly and assign to the keyword argument, or in a short form, you can also simply provide a value and let <code>interact()</code> infer the widget. If the argument is:</p> <ul> <li>a boolean (<code>True</code> or <code>False</code>), a checkbox widget is provided (<code>widgets.Checkbox</code>)</li> <li>an integer or a float: a slider widget is provided (<code>widgets.IntSlider</code> or <code>widgets.FloatSlider</code>)</li> <li>a string: a textbox widget is provided (<code>widget.Text</code>)</li> <li>a list of strings: a dropdown widget is provided (<code>widget.Dropdown</code>)</li> </ul> <p>Full list of available widgets and their configuration can be found in <a href="https://ipywidgets.readthedocs.io/en/latest/examples/Widget%20List.html">ipywidget documentation</a></p> <p>The way to connect the ipywidgets to matplotlib is as follows.</p> <p>Let us try to plot a sine curve with different angular frequency, phrase, and amplitude. If we use the <code>inline</code> backend, this is the code:</p> <pre><code class="language-python">import numpy as np import matplotlib.pyplot as plt from ipywidgets import interact %matplotlib inline x = np.linspace(0, 5*np.pi, 500) @interact(w=(0, 10, .1), amp=(-4, 4, .1), phi=(0, 2*np.pi, 0.1)) def plot(w=1.0, amp=1, phi=0): y = amp*np.sin(w*x-phi) plt.plot(x,y) plt.ylim([-4,4]) plt.show() </code></pre> <p><img src="/img/jupyter-02.png" alt="" /></p> <p>The function <code>plot()</code> uses keyword arguments that matched the <code>interact()</code> function. The tuple notations are just another way to specify a slider (in terms off min, max, and step). In the function, it simply replot the figure everytime using the value provided by the slider. This works in <code>inline</code> backend because all pictures are static.</p> <p>If we use the <code>widget</code> backend or <code>ipympl</code> backend instead, we do this:</p> <pre><code class="language-python">%matplotlib widget fig, ax = plt.subplots(figsize=(6, 4)) ax.set_ylim([-4, 4]) ax.grid(True) # fix x values x = np.linspace(0, 5*np.pi, 500) ax.scatter(x[::20], np.cos(x)[::20], color='r', alpha=0.5) @interact(w=(0, 10, .1), amp=(-4, 4, .1), phi=(0, 2*np.pi, 0.1)) def update(w=1.0, amp=1, phi=0): """Remove old lines from plot and plot new one""" for l in ax.lines: l.remove() ax.plot(x, amp*np.sin(w*x-phi), color='C0') </code></pre> <p><img src="/img/jupyter-03.png" alt="" /></p> <p>This is different from the previous in the sense that we do not call <code>plt.show()</code> but simply remove the plot line and draw a new one. Note that when we do the removal, the scatter plot is not removed because it is not part of the <code>ax.lines</code>. Also, we do not need to redraw other elements of the plot. The figure about also shows the icons on the left, which are part of the <code>widget</code> or <code>ipympl</code> backend that allows us to pan and zoom.</p> <p>There is yet another way to do the same, which the line is not even removed but simply updated:</p> <pre><code class="language-python">%matplotlib notebook fig, ax = plt.subplots(figsize=(6, 4)) ax.set_ylim([-4, 4]) ax.grid(True) # fix x values, and create line plot object x = np.linspace(0, 5*np.pi, 500) line, = ax.plot(x,np.sin(x)) @interact(w=(0, 10, .1), amp=(-4, 4, .1), phi=(0, 2*np.pi, 0.1)) def plot(w=1.0, amp=1, phi=0): line.set_ydata(amp*np.sin(w*x-phi)) #fig.canvas.draw() </code></pre> <p><img src="/img/jupyter-04.png" alt="" /></p> <p>This code works only on jupyter-notebooks but not jupyterlab, for the <code>notebook</code> backend is used. What it does is to create a line as a object from <code>ax.plot()</code> and then when the widgets are updated, the data of the line are updated using <code>line.set_ydata()</code>. Usually you may see the examples elsewhere would invoke <code>fig.canvas.draw()</code> after the <code>set_ydata()</code> function so the changes are applied. But I found that is unnecessary.</p> <p>If we use seaborn, the code is mostly the same since it is just a wrapper to matplotlib. The exception is the line object in the <code>notebook</code> backend example above as <code>seaborn.lineplot()</code> will return you the axis, not the line object.</p> <h2 id="bokeh-and-ipywidget">Bokeh and ipywidget</h2> <p>This is a similar example in Bokeh</p> <pre><code class="language-python">from bokeh.io import output_notebook, push_notebook output_notebook() from bokeh.layouts import column, row from bokeh.models import Slider, Span, Range1d from bokeh.plotting import figure, show from bokeh.palettes import cividis from ipywidgets import interact, interactive, widgets plot = figure(plot_width=800, plot_height=400) x = np.linspace(0, 5*np.pi, 500) color = cividis(5) sine = plot.line(x, np.sin(x), line_width=1, alpha=0.8, line_color=color, legend_label="sin") cosine = plot.line(x, np.cos(x), line_width=1, alpha=0.8, line_color=color, legend_label="cos") vline = Span(location=0, dimension="height", line_color=color, line_width=3, line_alpha=0.5) hline = Span(location=0, dimension="width", line_color=color, line_width=3, line_alpha=0.5) plot.add_layout(vline) plot.add_layout(hline) plot.title.text = "Sine and cosine" plot.legend.click_policy = "hide" plot.legend.location = "top_left" plot.xaxis.axis_label = "x" plot.yaxis.axis_label = "y" plot.y_range = Range1d(-4, 4) handle = show(plot, notebook_handle=True) # Slider: Using ipython widgets slider instead of Bokeh slider @interact(w=widgets.FloatSlider(min=-10, max=10, value=1), amp=widgets.FloatSlider(min=-5, max=5, value=1), phi=widgets.FloatSlider(min=-4, max=4, value=0)) def update(w=1.0, amp=1, phi=0): sine.data_source.data["y"] = amp*np.sin(w*x-phi) cosine.data_source.data["y"] = amp*np.cos(w*x-phi) vline.location = phi hline.location = amp*np.sin(-phi) push_notebook(handle=handle) </code></pre> <p><img src="/img/jupyter-05.png" alt="" /></p> <p>The logic is similar to the case of <code>notebook</code> backend for matplotlib but this works for both jupyter-notebooks and jupyterlab. Bokeh allows to change the data of the data source but the x and y dimension must be consistent. If we change the curve entirely, we can either use <code>data_source.data.update(x=x, y=y)</code> to do the update in one shot, or reassign the data with <code>data_source.data = newdata</code>, where <code>newdata</code> can be a python dictionary. What necessary in using Bokeh interactively are</p> <ul> <li>after we set up the figure, we show it with <code>show(plot, notebook_handle=True)</code> and remember the handle</li> <li>in the update function, after we update the data, we need to invoke <code>push_notebook(handle=handle)</code> to refresh the figure as pointed by the handle</li> </ul> <p>The handle is not necessarily for one figure. Other widgets or multiple figures can be shown using the same notebook handle. The <code>push_notebook()</code> call is to make the handle refresh itself as some underlying data is known to be changed.</p> <p>Bokeh indeed goes with its own slider widget but it will not work in the notebook because it is purely Javascript. Unless we can do the interactive update in Javascript (e.g., all data are loaded, and the updated value can be computed using Javascript), it will not get the job ddone. The other use of Bokeh widgets is when we have a Bokeh server, which the widget will get the data updated via a web request. If we use the Bokeh slider anyway, we will get an error message:</p> <pre><code class="language-python">def update(w=1.0, amp=1, phi=0): sine.data_source.data["y"] = amp*np.sin(w*x-phi) cosine.data_source.data["y"] = amp*np.cos(w*x-phi) vline.location = phi hline.location = amp*np.sin(-phi) push_notebook(handle=handle) # Bokeh sliders slider_w = Slider(start=-10, end=10, value=1, step=0.1, title="frequency") slider_amp = Slider(start=-5, end=5, value=1, step=0.1, title="amplitude") slider_phi = Slider(start=-4, end=4, value=0, step=0.1, title="phrase") def slider_change(attr, old, new): update(slider_w.value, slider_amp.value, slider_phi.value) slider_w.on_change('value', slider_change) slider_amp.on_change('value', slider_change) slider_phi.on_change('value', slider_change) handle = show(column(plot, slider_w, slider_amp, slider_phi), notebook_handle=True) </code></pre> <p>This will be shown on the notebook with the widgets impotent.</p> <pre><code class="language-text">WARNING:bokeh.embed.util: You are generating standalone HTML/JS output, but trying to use real Python callbacks (i.e. with on_change or on_event). This combination cannot work. Only JavaScript callbacks may be used with standalone output. For more information on JavaScript callbacks with Bokeh, see: https://docs.bokeh.org/en/latest/docs/user_guide/interaction/callbacks.html Alternatively, to use real Python callbacks, a Bokeh server application may be used. For more information on building and running Bokeh applications, see: https://docs.bokeh.org/en/latest/docs/user_guide/server.html </code></pre> <h2 id="jupyterlab">Jupyterlab</h2> <p>Because of the different design, the jupyter notebook is way easier to set up the interactive widgets. Your installation should include <code>ipywidgets</code> and <code>widgetsnbextension</code> (which the latter should be automatically installed by the former). To get the ipywidgets working in jupyterlab, after these python modules are installed, you still need to install node.js (<code>brew install nodejs</code>) and then run the following command</p> <pre><code>jupyter labextension install @jupyter-widgets/jupyterlab-manager </code></pre> <p>After this, a restart of jupyterlab will make it work.</p>Adrian S. Tamrighthandabacus@users.github.comJupyter notebooks and visualization are natural marriage. It is more fun if we can skew this or that a bit by turning a knob or selecting something from a drop down. This is where so called interactive widgets come to play. There are a lot of examples on how to set up a widget and control the matplotlib chart interactively. Doing so in jupyterlab, however, is not so straightforward.Lagrangians and Portfolio Optimization2021-06-22T12:04:14-04:002021-06-22T12:04:14-04:00https://www.adrian.idv.hk/kkt<p>A portfolio optimization problem in Markowitz style looks like the following</p> \begin{aligned} \min &amp;&amp; f(w) &amp;= \frac12 w^T\Sigma w\\ \textrm{subject to} &amp;&amp; w^Tr &amp;= R \\ &amp;&amp; w^T e &amp;= 1 \\ &amp;&amp; w &amp; \succeq b_L \\ &amp;&amp; w &amp; \preceq b_U \end{aligned} <p>which the last two are to bound the weight of each asset in the portfolio. This is a nicely formulated optimization problem and one way to analytically solve it is to use Lagrange multipliers.</p> <h2 id="shadow-price">Shadow price</h2> <p>Assume we do not have the last two inequality constraints, the Lagrangian for the above problem would be</p> $L(w,\lambda) = \frac12w^T\Sigma w - \lambda_1(w^Tr-R) - \lambda_2(w^Te-1)$ <p>The Lagrangian has the property that for the optimal solution $$w^*$$ to the original problem, $$L(w^*,\lambda) = f(w^*)$$, namely the Lagrangian function attained the same value as the objective function. This is trivial as we know that the optimizer must satisfy the equality constraints and hence the two extra terms in the Lagrangian always reduced to zero.</p> <p>Mathematically, we can also make the Lagrangian having two Lagrange multiplier terms added to the objective function instead of subtraction. In doing so, we reversed our solution to $$\lambda_1$$ and $$\lambda_2$$ above. But if we consider that</p> $\frac{\partial L(w^*,\lambda)}{\partial R} = \lambda_1$ <p>we see that it bears a physical meaning for subtraction, i.e., $$\lambda_1$$ indicates how much it changes to the objective function if we marginally increased the boundary value $$R$$ on the constraint. In this particular equality constraint, we are pushing the expected portfolio return $$R$$ to a higher level and $$\lambda_1$$ is the amount of variance increased. Hence the Lagrange multiplier $$\lambda_1$$ is called the <em>shadow price</em> for the return $$R$$.</p> <h2 id="inequality-constraints-and-activeness">Inequality constraints and activeness</h2> <p>A similar Lagrangian can be created when there are inequality constraints, but their Lagrange multiplier is no longer arbitrary:</p> $L(w, \lambda, \theta, \phi) =\frac12 w^T\Sigma w -\lambda_1(w^Tr-R) - \lambda_2(w^Te-1) -\theta^T(w-b_L) + \phi^T(w-b_U)$ <p>The way to think of what sign should a Lagrange multiplier carry is to consider the dual. As we are doing a minimization here, the dual is a maximization problem, namely,</p> $g(\lambda,\theta,\phi) = \inf_w L(w,\lambda,\theta,\phi)$ <p>and according to the max-min inequality we have the weak-duality property</p> $\sup_{\lambda,\theta,\phi}\inf_w L(w,\lambda,\theta,\phi) \le \inf_w \sup_{\lambda,\theta,\phi}L(w,\lambda,\theta,\phi)$ <p>and the equality holds if we have strong duality. The RHS is the solution to the optimization problem and the LHS is the dual problem. Therefore the dual must be less than the optimal solution in the original problem</p> $g(\lambda,\theta,\phi) \le \inf_w\sup_{\lambda,\theta,\phi} L(w,\lambda,\theta,\phi)$ <p>If we consider the Lagrange multipliers associated with inequality constraints $$\theta$$ and $$\phi$$ to be positive (there is no restriction for equality constraints), we must augment the objective function into $$L(w,\lambda,\theta,\phi)$$ with negative values. Hence for $$w-b_L\succeq 0$$, we augment it with $$-\theta(w-b_L)$$, and for $$w-b_U\preceq 0$$, we augment it with $$+\phi(w-b_U)$$.</p> <p>Why we need it in this way? Let us denote the feasible domain as $$\mathcal{D}$$ and the optimal solution to the problem as $$w^*\in\mathcal{D}$$. An inequality constraint at $$w^*$$ shall either has the equality holds (which we call the constraint is <em>active</em>) or not (<em>inactive</em>). An constraint is inactive iff its removal does not change the optimal solution. The boundary of $$\mathcal{D}$$ is defined by the constraints as if they are active (which the equality constraints can be assumed always active).</p> <p>The solution $$w^*$$ is a point on this boundary. As we are studying a minimization problem, $$f(w^*)$$ is increasing into $$\mathcal{D}$$ and decreasing away from $$\mathcal{D}$$. Similarly, if $$w-b_L\succeq 0$$ is active, $$w^*-b_L=0$$ and it is increasing into $$\mathcal{D}$$ and decreasing away from it (and similarly for $$w-b_U\preceq 0$$). In summary, we have</p> <table> <thead> <tr> <th> </th> <th>$$w^*+\delta\in\mathcal{D}$$</th> <th>$$w^*+\delta \notin\mathcal{D}$$</th> </tr> </thead> <tbody> <tr> <td>$$f(w^*)$$</td> <td>$$f(w^*+\delta)\ge f(w^*)$$</td> <td>$$f(w^*+\delta)\le f(w^*)$$</td> </tr> <tr> <td>$$w^*-b_L = 0$$</td> <td>$$w^*+\delta-b_L\succeq 0$$</td> <td>$$w^*+\delta-b_L\preceq 0$$</td> </tr> <tr> <td>$$w^*-b_U = 0$$</td> <td>$$w^*+\delta-b_U\preceq 0$$</td> <td>$$w^*+\delta-b_U\succeq 0$$</td> </tr> </tbody> </table> <p>and we need to make $$L(w^*+\delta,\lambda,\theta,\phi)\ge L(w^*,\lambda,\theta,\phi)$$ for $$w^*+\delta\in\mathcal{D}$$ so that we can find the optimal solution $$w^* = \arg\min L(w,\lambda,\theta,\phi)$$ as a saddle point.</p> <h2 id="karush-kuhn-tucker-conditions">Karush-Kuhn-Tucker conditions</h2> <p>The KKT conditions state that</p> <ol> <li>$$\nabla L(w^*,\lambda,\theta,\phi)=0$$ at the optimal solution $$w^*$$</li> <li>Primal constraints are satisfied for $$w^*$$</li> <li>Dual constraints $$\theta\ge 0$$ and $$\phi\ge 0$$ are satisfied, i.e. the Lagrange multipliers for inequality constraints are non-negative</li> <li>Complementary slackness: $$\theta\odot(w^*-b_L)=0$$ and $$\phi\odot(w^*-b_U)=0$$, i.e., the Lagrange multiplier will be zero if the corresponding inequality constraint is inactive</li> </ol> <h2 id="solution">Solution</h2> <p>We can use the KKT conditions to solve for the above optimization problem. Since</p> $L(w, \lambda, \theta, \phi) =\frac12 w^T\Sigma w -\lambda_1(w^Tr-R) - \lambda_2(w^Te-1) -\theta^T(w-b_L) + \phi^T(w-b_U)$ <p>The first condition states that</p> $\nabla_w L(w, \lambda, \theta, \phi) = \Sigma w -\lambda_1r - \lambda_2 e -\theta + \phi = 0$ <p>the second condition states that</p> \begin{aligned} w^Tr - R = -\nabla_{\lambda_1} L(w, \lambda, \theta, \phi) &amp;=0 \\ w^Te - 1 = -\nabla_{\lambda_2} L(w, \lambda, \theta, \phi) &amp;=0 \\ w - b_L &amp; \succeq 0 \\ w - b_U &amp; \preceq 0 \end{aligned} <p>the third condition states that</p> $\theta \ge 0;\qquad\phi \ge 0$ <p>and the fourth condition states that</p> $\theta\odot(w-b_L)=0;\qquad\phi\odot(w-b_U)=0.$ <p>Assume $$w$$ is a vector of $$n$$ elements, we have $$3n+2$$ unknowns ($$w,\theta,\phi$$ are $$n$$-vectors and $$\lambda$$ is 2-vector), $$n+2+0+2n=3n+2$$ equalities from the four conditions, and $$0+2n+2n+0=4n$$ inequalities. This should sufficient to provide a solution, but note that the equations from fourth condition are nonlinear as it includes $$\theta\odot w$$ and $$\phi\odot w$$ terms. To make it a system of linear equations, we can consider various combination of activeness of inequality constraints to simplify it. It would be tremendously easier if none of the inequality constraints are active (e.g., when $$b_L=-\infty$$ and $$b_U=\infty$$, which for sure $$\theta=\phi=\mathbf{0}$$ based on the complementart slackness), in this case we have</p> \begin{aligned} \Sigma w - \lambda_1r-\lambda_2e &amp;=0 \\ w &amp;= \Sigma^{-1}(\lambda_1r+\lambda_2e) \\ &amp;= \lambda_1\Sigma^{-1}r+\lambda_2\Sigma^{-1}e \end{aligned} <p>substitute:</p> \begin{aligned} w^Tr - R &amp;= \lambda_1 r^T\Sigma^{-1}r + \lambda_2 e^T\Sigma^{-1}r - R = 0 \\ w^Te - 1 &amp;= \lambda_1r^T\Sigma^{-1}e+\lambda_2e^T\Sigma^{-1}e - 1 = 0 \end{aligned} <p>therefore</p> \begin{aligned} \begin{bmatrix}r^T\Sigma^{-1}e &amp; r^T\Sigma^{-1}e\\ r^T\Sigma^{-1}e &amp; e^T\Sigma^{-1}e\end{bmatrix} \begin{bmatrix}\lambda_1\\ \lambda_2\end{bmatrix} &amp;= \begin{bmatrix}R\\ 1\end{bmatrix} \\ \begin{bmatrix}\lambda_1\\ \lambda_2\end{bmatrix} &amp;= \begin{bmatrix}r^T\Sigma^{-1}e &amp; r^T\Sigma^{-1}e\\ r^T\Sigma^{-1}e &amp; e^T\Sigma^{-1}e\end{bmatrix}^{-1}\begin{bmatrix}R\\ 1\end{bmatrix} \end{aligned} <p>and substitute back to above for $$w^*$$. But the solution under this condition must not violate the second conditions, namely, $$b_L \preceq w^* \preceq b_U$$. In fact we can also solve for both $$w$$ and $$\lambda$$ together in a matrix form equation of</p> \begin{aligned} \Sigma w -\lambda_1 r - \lambda_2 e &amp;= 0 \\ w^Tr - R &amp;=0 \\ w^Te - 1 &amp;=0 \\ \implies\quad \begin{bmatrix}\Sigma &amp; r &amp; e\\ r^T &amp; 0 &amp; 0\\ e^T &amp; 0 &amp; 0\end{bmatrix} \begin{bmatrix}w\\ -\lambda_1\\ -\lambda_2\end{bmatrix} &amp;= \begin{bmatrix}0\\ R\\ 1\end{bmatrix} \\ \therefore\quad \begin{bmatrix}w\\ -\lambda_1\\ -\lambda_2\end{bmatrix} &amp;= \begin{bmatrix}\Sigma &amp; r &amp; e\\ r^T &amp; 0 &amp; 0\\ e^T &amp; 0 &amp; 0\end{bmatrix}^{-1}\begin{bmatrix}0\\ R\\ 1\end{bmatrix}. \end{aligned} <p>But the essence of using Karush-Kuhn-Tucker conditions to solve an optimization problem with inequality constraints is to make it combinatorial. Assume $$b_L\prec b_U$$ and in some reasonable finite range (e.g. $$b_L=\mathbf{0}$$ and $$b_U=e$$), to solve this we need to test all combinations of activeness of inequality constraints. In above, we have $$2n$$ inequalities from the second KKT condition and there are $$2^{2n}$$ combinations of activeness. When an inequality is active, its equality holds and the corresponding Lagrange multiplier can be non-zero. Hence a new set of linear equations are created and we can solve for $$w$$ and other Lagrange multipliers, but we need to validate the solution not violating the KKT conditions, especially that of $$b_L \preceq w \preceq b_U$$, and check the objective function. For example, if all inequality constraints are active, the optimization problem has its solution presented as</p> \begin{aligned} \begin{bmatrix}\Sigma &amp; r &amp; e &amp; I &amp; I\\ r^T &amp; 0 &amp; 0 &amp; 0 &amp; 0\\ e^T &amp; 0 &amp; 0 &amp; 0 &amp; 0 \\I &amp; 0 &amp; 0 &amp; 0 &amp; 0\\ I &amp; 0 &amp; 0 &amp; 0 &amp; 0\end{bmatrix} \begin{bmatrix}w\\ -\lambda_1\\ -\lambda_2\\ -\theta\\ \phi\end{bmatrix} &amp;= \begin{bmatrix}0\\ R\\ 1\\ b_L\\ b_U\end{bmatrix} \\ \therefore\quad \begin{bmatrix}w\\ -\lambda_1\\ -\lambda_2\\ -\theta\\ \phi\end{bmatrix} &amp;= \begin{bmatrix}\Sigma &amp; r &amp; e &amp; I &amp; I\\ r^T &amp; 0 &amp; 0 &amp; 0 &amp; 0\\ e^T &amp; 0 &amp; 0 &amp; 0 &amp; 0 \\I &amp; 0 &amp; 0 &amp; 0 &amp; 0\\ I &amp; 0 &amp; 0 &amp; 0 &amp; 0\end{bmatrix}^{-1} \begin{bmatrix}0\\ R\\ 1\\ b_L\\ b_U\end{bmatrix} \end{aligned} <p>and if some constraints are inactive, some of the rows and columns above shall be removed. After checking all combinations of activeness, the best solution based on the objective function are selected.</p> <h2 id="implementation">Implementation</h2> <p>The function below shows how the above optimization can be solved numerically. It try out all combinations of activeness and find the solution using the matrix equation described above. The best solution is then returned.</p> <pre><code class="language-python">import numpy as np def markowitz(ret, cov, r, lb=np.nan, ub=np.nan): """Markowitz minimizer with bounds constraints for a specified portfolio return Args: ret: A vector of N asset returns cov: NxN matrix of covariance of asset returns r (float): portfolio return to achieve lb, ub (float or vector): lowerbound and upperbound for the portfolio weights, if float, all weights are subject the same bound Returns: A (N+2) vector of portfolio weights and the Lagrange multipliers or None if no solution can be found """ # Sanitation ret = np.array(ret).squeeze() cov = np.array(cov).squeeze() r = float(r) N = len(ret) if ret.shape != (N,): raise ValueError("Asset returns ret should be a vector") if cov.shape != (N,N): raise ValueError("Covariance matrix cov should be in shape ({},{}) to match the return vector".format(N,N)) if isinstance(lb, (float,int)): lb = np.ones(N) * lb if isinstance(ub, (float,int)): ub = np.ones(N) * ub lb = lb.squeeze() ub = ub.squeeze() if lb.shape != (N,): raise ValueError("Lowerbound lb should be in shape (%d,) to match the return vector" % N) if ub.shape != (N,): raise ValueError("Upperbound ub should be in shape (%d,) to match the return vector" % N) if (lb &gt; ub).any(): raise ValueError("Lowerbound must no greater than upperbound") # Construct matrices as templates for the equation AX=B A = np.zeros((N+2+N+N,N+2+N+N)) A[:N, :N] = cov A[:N, N] = A[N, :N] = ret A[:N, N+1] = A[N+1, :N] = np.ones(N) A[:N, N+2:N+N+2] = A[N+2:N+N+2, :N] = A[:N, N+N+2:] = A[N+N+2:, :N] = np.eye(N) b = np.zeros((N+2+N+N,1)) b[N:N+2, 0] = [r, 1] b[N+2:N+N+2, 0] = lb b[N+N+2:, 0] = ub # Try all activeness combinations and track the best result to minimize objective bitmaps = 2**(2*N) best_obj = np.inf best_vector = None for bitmap in range(bitmaps): # constraints 0 to N-1 are for lowerbound and N to 2N-1 are for upperbound # row/column N+2+i corresponds to the constraint i inactive = [N+2+i for i in range(2*N) if bitmap &amp; (2**i)] active = [N+2+i for i in range(2*N) if i not in inactive] # verify no conflicting active constraints if any(N+i in active for i in active): continue # conflicting activeness found, skip this one # Delete some rows and columns from the template for this activeness combination A_ = np.delete(np.delete(A, inactive, axis=0), inactive, axis=1) b_ = np.delete(b, inactive, axis=0) # Solve and check using matrix algebra try: x_ = (np.linalg.inv(A_) @ b_).squeeze() w = x_[:N] if (w &lt; lb).any() or (w &gt; ub).any(): continue # solution not in feasible domain, try next one obj_val = w @ cov @ w # compute the covariance, i.e., objective function * 2 if obj_val &lt; best_obj: # Lower variance found, save the solution vector best_obj = obj_val x = np.zeros(N+2+N+N) x[:N+2] = x_[:N+2] # w and negative lambda x[active] = x_[N+2:] # negative theta and phi x[N:N+2+N] *= -1 # lambda and theta are negated best_vector = x except np.linalg.LinAlgError: pass # no solution found for this combination return best_vector </code></pre>Adrian S. Tamrighthandabacus@users.github.comA portfolio optimization problem in Markowitz style looks like the followinghtop cheatsheet2021-05-25T00:00:00-04:002021-05-25T00:00:00-04:00https://www.adrian.idv.hk/htop<p><code>htop</code> is useful and bring forth very rich information on one screen. Here is the cheatsheet to understand it:</p> <p><img src="/img/htop.png" alt="htop cheatsheet" /></p> <p>The source <a href="/img/htop.key">Keynote file is available</a>. Of course, there are <a href="https://peteris.rocks/blog/htop/">more detailed explanation</a> as well as pressing <code>h</code> for help screen.</p>Adrian S. Tamrighthandabacus@users.github.comhtop is useful and bring forth very rich information on one screen. Here is the cheatsheet to understand it: