Think about it for a short while, anyone can see that there is no definitive way to define what is a percentile. What is surprising is how many different ways we can define it.

In a 1996 paper, there is a generic model proposed for what a percentile function can be and nine different variations enumerated.

# Terms and definition

Percentile, or any quantile, is a rank statistic. Assume we have a set of measurements and the version in ascending order of magnitude . Roughly speaking, the quantile function , defined over , gives some value such that the number of measurements is fraction of :

We call this a percentile function if is defined in percentage.

There are several characteristics that the quantile function may satisfy. As mentioned in Hyndman & Fan (1996),

- Surjective from to : is continuous on
- Approximation of distribution function: gives the lower bound of the
number of observations in less than ,

- Quantile are based on samples arranged evenly and symmetrically on :

- Min and max in are in :

- is the median

# Classes of quantile functions

Quantile function can look like a step function, or a piecewise linear continuous function.

Hyndman & Fan proposed a generic form for quantile function:

such that

## Step functions

Wikipedia covers one step function
variant, the *nearest rank method*, which defined such that
. This is similar to the *1st definition* of Hyndman & Fan,
which has the properties that

- step function with jumps on for integral .
- caglad: each step is a horizontal line segment with closed end on left and open end on right, i.e. instead of (Wikipedia article’s definition use ).

We can have variation on how to handle the end points of steps to this function.
The *2nd definition* of Hyndman & Fan is to use midpoint, i.e., the step
function still have jumps on but has the particular cases defined
.

Formally, the 1st definition has (so jumps on ), if or 0 otherwise. The 2nd definition has , for and for .

The *3rd definition* of the paper gives an alternative way to jump: The step
function has jumps at the midpoint of and instead. And each
step is a line segment that is closed on both end if it is a step on for
an even number (even steps) or the line segment is open on both end
otherwise (odd steps). Formally, , if and
is even, or otherwise. This is the implementation in SAS.

## Interpolated functions

This is not a step function but a monotonically increasing continuous function. Instead of jumps, the function is a connected line segments. Wikipedia gives the following generic form: such that

where and for . is interpolated from adjacent measurements and . Hence this is not a step function.

This is used in Excel. The function `PERCENTILE.INC(array,k)`

has and
and `PERCENTILE.EXC(array,k)`

has and
. The former, also as the *7th definition* in Hyndman & Fan,
defined the 0th and 100th percentile as the min and max value of
respectively. The latter, also known as the *6th definition* in Hyndman & Fan,
however, has the min above 0th percentile and the max below 100th percentile
(assuming is a sample and the sample min/max is not the population
min/max).

Matlab has or . Same as
`PERCENTILE.EXC(array,k)`

, the min and max are not on 0th and 100th percentile.
This is also the *5th definition* in Hyndman & Fan.

These three variation can be understood from a histogram setting: Assume we have to layout the data points as histogram on :

- The case: We can split into equal segments and put the data points at the segment boundaries (including 0 and 1)
- The case: We can split into equal segments and put the data points at the middle of each segment
- The case: We can split into equal segments and put the
data points at the segment boundaries (
*not*including 0 and 1)

After the data points are placed, is the function that connected them with straight line segments.

Hyndman & Fan has three more variation of the interpolation function:

*4th definition*: , max is the 100th percentile. Histogram representation is to split into equal segments like the way above but locate the data point at right end of each segment*8th definition*: . Histogram representation is to split into equal segments, and from the -th segment onward up to -th segment, there are segments apart. We place the data points evenly in between, including the ends on the -th and -th. This is the optimal estimator for median, if is a sample.*9th definition*: similar to above but defines . This is a better estimator for mean for samples drawn from normal distribution.

# Inverse quantile function

In excel, the inverse of percentile function is called percentile rank
(`PERCENTILERANK.INC(array,x)`

and `PERCENTILERANK.EXC(array,x)`

). The inverse
function is similarly confusing to define and additionally imposed another
problem.

In case of duplicated values in the order statistics , e.g., , what should be ?

Given the inverse property, we knows that for and defined similarly.

Excel uses . But it is equally reasonable for and .

## Bibliographic data

```
@article{
title = "Sample Quantiles in Statistical Packages",
author = "Rob J. Hyndman and Yanan Fan",
year = "1996",
journal = "The American Statistician",
volume = "50",
number = "4",
month = "Nov",
pages = "361-365",
doi = "http://dx.doi.org/10.1080/00031305.1996.10473566",
}
```