Think about it for a short while, anyone can see that there is no definitive way to define what is a percentile. What is surprising is how many different ways we can define it.

In a 1996 paper, there is a generic model proposed for what a percentile function can be and nine different variations enumerated.

# Terms and definition

Percentile, or any quantile, is a rank statistic. Assume we have a set of measurements \(X_1,\cdots,X_n\) and the version in ascending order of magnitude \(\mathbf{X}=\{X_{(1)}, X_{(2)}, \cdots, X_{(n)}\}\). Roughly speaking, the quantile function \(Q(p)\), defined over \(p\in[0,1]\), gives some value \(X_{(k)}\in\mathbf{X}\) such that the number of measurements is fraction \(p\) of \(|\mathbf{X}|\):

\[\frac{1}{n}|\{X_i: X_i\in\mathbf{X}, X_i\le X_{(k)} \}| \approx p.\]We call this a percentile function if \(p\) is defined in percentage.

There are several characteristics that the quantile function \(Q(p)\) may satisfy. As mentioned in Hyndman & Fan (1996),

- Surjective from \([0,1]\) to \(\mathbf{X}\): \(Q(p)\) is continuous on \(p\)
- Approximation of distribution function: \(pn\) gives the lower bound of the
number of observations \(X_i\) in \(\mathbf{X}\) less than \(Q(p)\),

\(|\{X_i: X_i\in\mathbf{X}, X_i\le Q(p)\}| \ge pn\) - Quantile are based on samples arranged evenly and symmetrically on \([0,1]\):

\(\begin{gather} |\{X_i: X_i\in\mathbf{X}, X_i\le Q(p)\}| = |\{X_i: X_i\in\mathbf{X}, X_i\ge Q(1-p)\}| \\ Q^{-1}(X_{(k)}) + Q^{-1}(X_{(n-k+1)}) = 1 \end{gather}\) - Min and max in \(\mathbf{X}\) are in \([0,1]\):

\(Q^{-1}(X_{(1)}) \ge 0, Q^{-1}(X_{(n)}) \le 1\) - \(Q(0.5)\) is the median

# Classes of quantile functions

Quantile function can look like a step function, or a piecewise linear continuous function.

Hyndman & Fan proposed a generic form for quantile function:

\[Q_i(p) = (1-\gamma)X_{(j)} + \gamma X_{(j+1)}\]such that

\[\begin{gather} \frac{j-m}{n} \le p < \frac{j-m+1}{n} \\ \gamma \in [0,1] \\ j = \lfloor pn + m \rfloor \\ g = pn + m - j \end{gather}\]## Step functions

Wikipedia covers one step function
variant, the *nearest rank method*, which defined \(Q(p)=X_{(j)}\) such that
\(j=\lceil pn \rceil\). This is similar to the *1st definition* of Hyndman & Fan,
which has the properties that

- step function with jumps on \(p=j/n\) for integral \(j\).
- caglad: each step is a horizontal line segment with closed end on left and open end on right, i.e. \(Q(j/n) = X_{(j+1)}\) instead of \(X_{(j)}\) (Wikipedia article’s definition use \(Q(j/n)=X_{(j)}\)).

We can have variation on how to handle the end points of steps to this function.
The *2nd definition* of Hyndman & Fan is to use midpoint, i.e., the step
function still have jumps on \(p=j/n\) but has the particular cases defined
\(Q(j/n)=\frac{1}{2}(X_{(j+1)} + X_{(j)})\).

Formally, the 1st definition has \(m=0\) (so jumps on \(p=j/n\)), \(\gamma=1\) if \(g>0\) or 0 otherwise. The 2nd definition has \(m=0\), \(\gamma=\frac{1}{2}\) for \(g=0\) and \(\gamma=1\) for \(g>0\).

The *3rd definition* of the paper gives an alternative way to jump: The step
function has jumps at the midpoint of \(p=j/n\) and \(p=(j+1)/n\) instead. And each
step is a line segment that is closed on both end if it is a step on \(j/n\) for
\(j\) an even number (even steps) or the line segment is open on both end
otherwise (odd steps). Formally, \(m=-\frac{1}{2}\), \(\gamma=0\) if \(g=0\) and \(j\)
is even, or \(\gamma=1\) otherwise. This is the implementation in SAS.

## Interpolated functions

This is not a step function but a monotonically increasing continuous function. Instead of jumps, the function is a connected line segments. Wikipedia gives the following generic form: \(Q(p)=X_{(j)}\) such that

\[j = f(p) = (n+c_1)p+c_2 = (n+1-2c)p+c\]where \(c\in[0,1]\) and for \(j\notin\mathbb{N}\). \(X_{(j)}\) is interpolated from adjacent measurements \(X_{(\lfloor j \rfloor)}\) and \(X_{(\lceil j \rceil)}\). Hence this is not a step function.

This is used in Excel. The function `PERCENTILE.INC(array,k)`

has \(c=1\) and
\(p=\frac{j-1}{n-1}\) and `PERCENTILE.EXC(array,k)`

has \(c=0\) and
\(p=\frac{j}{n+1}\). The former, also as the *7th definition* in Hyndman & Fan,
defined the 0th and 100th percentile as the min and max value of \(\mathbb{X}\)
respectively. The latter, also known as the *6th definition* in Hyndman & Fan,
however, has the min above 0th percentile and the max below 100th percentile
(assuming \(\mathbf{X}\) is a sample and the sample min/max is not the population
min/max).

Matlab has \(c=\frac{1}{2}\) or \(p=\frac{j-1/2}{n}\). Same as
`PERCENTILE.EXC(array,k)`

, the min and max are not on 0th and 100th percentile.
This is also the *5th definition* in Hyndman & Fan.

These three variation can be understood from a histogram setting: Assume we have to layout the \(n\) data points as histogram on \([0,1]\):

- The \(c=1\) case: We can split \([0,1]\) into \(n-1\) equal segments and put the \(n\) data points at the \(n\) segment boundaries (including 0 and 1)
- The \(c=\frac{1}{2}\) case: We can split \([0,1]\) into \(n\) equal segments and put the \(n\) data points at the middle of each segment
- The \(c=0\) case: We can split \([0,1]\) into \(n+1\) equal segments and put the
\(n\) data points at the \(n\) segment boundaries (
*not*including 0 and 1)

After the data points are placed, \(Q(p)\) is the function that connected them with straight line segments.

Hyndman & Fan has three more variation of the interpolation function:

*4th definition*: \(p=\frac{j}{n}\), max is the 100th percentile. Histogram representation is to split \([0,1]\) into \(n\) equal segments like the \(c=\frac{1}{2}\) way above but locate the data point at right end of each segment*8th definition*: \(p=\frac{k-1/3}{n+1/3}\). Histogram representation is to split \([0,1]\) into \(n+\frac{1}{3}\) equal segments, and from the \(\frac{2}{3}\)-th segment onward up to \((n-\frac{1}{3})\)-th segment, there are \(n-1\) segments apart. We place the \(n\) data points evenly in between, including the ends on the \(\frac{2}{3}\)-th and \((n-\frac{1}{3})\)-th. This is the optimal estimator for median, if \(\mathbf{X}\) is a sample.*9th definition*: similar to above but defines \(p=\frac{k-3/8}{n+1/4}\). This is a better estimator for mean for samples drawn from normal distribution.

# Inverse quantile function

In excel, the inverse of percentile function is called percentile rank
(`PERCENTILERANK.INC(array,x)`

and `PERCENTILERANK.EXC(array,x)`

). The inverse
function is similarly confusing to define and additionally imposed another
problem.

In case of duplicated values in the order statistics \(\mathbf{X}\), e.g., \(X_{(j)}=X_{(j+1)}\), what should be \(Q^{-1}(X_{(j)})\)?

Given the inverse property, we knows that \(p_{\min} \le Q^{-1}(X_{(j)}) \le p_{\max}\) for \(p_{\min} = \min(p: Q(p)=X_{(j)})\) and \(p_{\max}\) defined similarly.

Excel uses \(Q^{-1}(X_{(j)}) = p_{\min}\). But it is equally reasonable for \(Q^{-1}(X_{(j)}) = p_{\max}\) and \(Q^{-1}(X_{(j)}) = \frac{1}{2}(p_{\max}+p_{\min})\).

## Bibliographic data

```
@article{
title = "Sample Quantiles in Statistical Packages",
author = "Rob J. Hyndman and Yanan Fan",
year = "1996",
journal = "The American Statistician",
volume = "50",
number = "4",
month = "Nov",
pages = "361-365",
doi = "http://dx.doi.org/10.1080/00031305.1996.10473566",
}
```