QQ-plot and PP-plot

Both QQ-plot and PP-plot are called the probability plot, but they are different. These plots are intended to compare two distributions, usually at least one of them is empirical. It is to graphically tell how good the two distributions fit.

Assume the two distributions have the cumulative distribution functions \(F(x)\) and \(G(x)\), the PP-plot is to show \(G(x)\) against \(F(x)\) for varying \(x\). Hence the domain and range of the plot is always from 0 to 1, as we are plotting only the range of the cumulative distribution functions.

QQ-plot, however, is to plot the inverse cumulative distribution function \(G^{-1}(x)\) against \(F^{-1}(x)\) for varying \(x\in[0,1]\). Therefore the domain and range of the plot are the support of the cumulative distribution functions \(F(x)\) and \(G(x)\). If we consider the data are empirical, we can see this as the plot of the order statistic of \(G(x)\) against that of \(F(x)\).

Tools

In Python, it is a surprise that matplotlib does not support making PP-plot nor QQ-plot out of the box. However, it should not be difficult to see that we can make use of the order statistics to do the QQ-plot:

import numpy as np
import matplotlib.pyplot as plt

N = 1000 # number of samples
rv_norm = np.random.randn(N) * 2 + 1  # normal with mean 1 s.d. 2
rv_uni = np.random.rand(N) * 8 - 4    # uniform in [-4,4]

plt.scatter(np.sort(rv_uni). np.sort(rv_norm), alpha=0.2, s=2)

QQplot

PP-plot is a bit more complicated. We need to use interpolation function to achieve that. The idea is that, the CDF of an empirical distribution can be constructed using np.sort(rv_norm) and np.linspace(0,1,N). Then with the other distribution, we can look up the value of the CDF by interpolation using np.interp(x0, x, y), which is to return \(y_0 = f(x_0)\) from the provided curve \(y=f(x)\):

plt.scatter(np.linspace(0,1,N), np.interp(np.sort(rv_uni), np.sort(rv_norm), np.linspace(0,1,N)), alpha=0.2, s=2)

PPplot

However, there is a fancier tool in Python to do this. statsmodels has functions qqplot() and qqplot_2samples() for doing QQ-plot of one empirical against theoretical normal distribution, and between two empirical distributions, respectively. But it is just a wrapper for the more generic ProbPlot object. For example, this is how we can do the same as above:

import statsmodels.api as sm

_ = sm.ProbPlot(rv_uni) \
      .qqplot(other=sm.ProbPlot(rv_norm), line="r", alpha=0.2, ms=2, lw=1)

plt.show()
_ = sm.ProbPlot(rv_uni) \
      .ppplot(other=sm.ProbPlot(rv_norm), line="r", alpha=0.2, ms=2, lw=1)
plt.show()

its output will go with the regression line (line="r"):

QQplot from statsmodels

PPplot from statsmodels

QQ-plot and PP-plot as EDA tool

When we get a table of data the first time, we would like to get some insight from it before further processing it. This is what the exploratory data analysis is about. For a multidimensional data set, my favorite is to run a correlogram to see how the data looks like, visually:

import seaborn as sns
sns.pairplot(df_data)

correlogram

This graph is generated using Seaborn, a wrapper for matplotlib. We can make the graph prettier, for example, draw the regression line between each pair of data and show the CDF empirically found by KDE instead of histogram:

sns.pairplot(df_data, diag_kind="kde", kind="reg",
             plot_kws={'line_kws':{'color':'red', 'alpha':0.2}, 'scatter_kws': {'alpha': 0.2, 's':4}},
            )
plt.show()

KDE correlogram

We call the same Seaborn function pairplot() with the kind="reg" (regression) for off-diagonal charts and diag_kind="kde" for on-diagonal charts. This tells you how correlated are any two series and the distribution of each sample. In this graph, we do not see any particularly strong correlation. So how about the series are independent but in the similar distribution? This can be answered by a PP-plot of each pair. Unfortunately, PP-plot and QQ-plot are not supported by Seaborn. Nevertheless we an add this. Here is the patch file, we need only to modify axisgrid.py and regression.py:

diff --git a/seaborn/axisgrid.py b/seaborn/axisgrid.py
index ba70553..7a9d836 100644
--- a/seaborn/axisgrid.py
+++ b/seaborn/axisgrid.py
@@ -1959,6 +1959,7 @@ def pairplot(
     """
     # Avoid circular import
     from .distributions import histplot, kdeplot
+    from .regression import qqplot, ppplot  # Avoid circular import
 
     # Handle deprecations
     if size is not None:
@@ -1992,7 +1993,7 @@ def pairplot(
     # Add the markers here as PairGrid has figured out how many levels of the
     # hue variable are needed and we don't want to duplicate that process
     if markers is not None:
-        if kind == "reg":
+        if kind in ["reg", "pp", "qq"]:
             # Needed until regplot supports style
             if grid.hue_names is None:
                 n_markers = 1
@@ -2020,6 +2021,10 @@ def pairplot(
         diag_kws.setdefault("fill", True)
         diag_kws.setdefault("warn_singular", False)
         grid.map_diag(kdeplot, **diag_kws)
+    elif diag_kind == "pp":
+        grid.map_diag(ppplot, **diag_kws)
+    elif diag_kind == "qq":
+        grid.map_diag(qqplot, **diag_kws)
 
     # Maybe plot on the off-diagonals
     if diag_kind is not None:
@@ -2030,6 +2035,10 @@ def pairplot(
     if kind == "scatter":
         from .relational import scatterplot  # Avoid circular import
         plotter(scatterplot, **plot_kws)
+    elif kind == "qq":
+        plotter(qqplot, **plot_kws)
+    elif kind == "pp":
+        plotter(ppplot, **plot_kws)
     elif kind == "reg":
         from .regression import regplot  # Avoid circular import
         plotter(regplot, **plot_kws)
diff --git a/seaborn/regression.py b/seaborn/regression.py
index ce21927..cc366d1 100644
--- a/seaborn/regression.py
+++ b/seaborn/regression.py
@@ -20,7 +20,7 @@ from .axisgrid import FacetGrid, _facet_docs
 from ._decorators import _deprecate_positional_args
 
 
-__all__ = ["lmplot", "regplot", "residplot"]
+__all__ = ["lmplot", "regplot", "residplot", "ppplot", "qqplot"]
 
 
 class _LinearPlotter(object):
@@ -833,6 +833,91 @@ lmplot.__doc__ = dedent("""\
     """).format(**_regression_docs)
 
 
+@_deprecate_positional_args
+def qqplot(
+    *,
+    x=None, y=None,
+    data=None,
+    x_estimator=None, x_bins=None, x_ci="ci",
+    scatter=True, fit_reg=True, ci=95, n_boot=1000, units=None,
+    seed=None, order=1, logistic=False, lowess=False, robust=False,
+    logx=False, x_partial=None, y_partial=None,
+    truncate=True, dropna=True, x_jitter=None, y_jitter=None,
+    label=None, color=None, marker="o",
+    scatter_kws=None, line_kws=None, ax=None,
+    legend=None
+):
+
+    plotter = _RegressionPlotter(x, y, data, x_estimator, x_bins, x_ci,
+                                 scatter, fit_reg, ci, n_boot, units, seed,
+                                 order, logistic, lowess, robust, logx,
+                                 x_partial, y_partial, truncate, dropna,
+                                 x_jitter, y_jitter, color, label)
+
+    # Manipulate input data for plotting
+    if plotter.x is None:
+        err = "missing x or y in plot data"
+        raise ValueError(err)
+    if plotter.y is None:
+        # set it to normal distribution scaled according to x
+        from scipy.stats import norm
+        plotter.y = norm.ppf(np.linspace(0,1,len(plotter.x)+2)[1:-1])
+        plotter.y = plotter.y * plotter.x.std() + plotter.x.mean()
+    plotter.x = np.sort(plotter.x)
+    plotter.y = np.sort(plotter.y)
+
+    if ax is None:
+        ax = plt.gca()
+
+    scatter_kws = {} if scatter_kws is None else copy.copy(scatter_kws)
+    scatter_kws["marker"] = marker
+    line_kws = {} if line_kws is None else copy.copy(line_kws)
+    plotter.plot(ax, scatter_kws, line_kws)
+    return ax
+
+
+@_deprecate_positional_args
+def ppplot(
+    *,
+    x=None, y=None,
+    data=None,
+    x_estimator=None, x_bins=None, x_ci="ci",
+    scatter=True, fit_reg=True, ci=95, n_boot=1000, units=None,
+    seed=None, order=1, logistic=False, lowess=False, robust=False,
+    logx=False, x_partial=None, y_partial=None,
+    truncate=True, dropna=True, x_jitter=None, y_jitter=None,
+    label=None, color=None, marker="o",
+    scatter_kws=None, line_kws=None, ax=None,
+    legend=None
+):
+
+    plotter = _RegressionPlotter(x, y, data, x_estimator, x_bins, x_ci,
+                                 scatter, fit_reg, ci, n_boot, units, seed,
+                                 order, logistic, lowess, robust, logx,
+                                 x_partial, y_partial, truncate, dropna,
+                                 x_jitter, y_jitter, color, label)
+
+    # Manipulate input data for plotting
+    if plotter.x is None:
+        err = "missing x in plot data"
+        raise ValueError(err)
+    if plotter.y is None:
+        # set it to normal distribution
+        from scipy.stats import norm
+        plotter.y = norm.ppf(np.linspace(0,1,len(plotter.x)+2)[1:-1])
+    linspace = np.linspace(0,1,len(plotter.x))
+    plotter.y = np.interp(np.sort(plotter.x), np.sort(plotter.y), linspace)
+    plotter.x = linspace
+    if plotter.fit_reg:
+        plotter.x_range = (0, 1)
+
+    if ax is None:
+        ax = plt.gca()
+
+    scatter_kws = {} if scatter_kws is None else copy.copy(scatter_kws)
+    scatter_kws["marker"] = marker
+    line_kws = {} if line_kws is None else copy.copy(line_kws)
+    plotter.plot(ax, scatter_kws, line_kws)
+    return ax
+
+
 @_deprecate_positional_args
 def regplot(
     *,

The key changes are the functions sns.ppplot() and sns.qqplot() defined in regression.py, which are modified from the function regplot(). The function regplot() is to do a scatter plot, then make a regression line on top of it. As we saw, PP-plot and QQ-plot are only the modified scatter plot. Therefore we manipulate the data in the plotter using np.sort() and np.interp() before we invoked its plot() function. At this point, these two functions can compare two empirical distributions. However, in the pairplot(), the diagonal charts are handled differently – be it a KDE plot or histogram plot. We can indeed make the PP-plot and QQ-plot a single distribution plot by plotting it against a theoretical normal distribution. The way we do it is to generate one if there is not the second distribution (y): Using the inverse normal CDF function norm.ppf() from scipy, we look for the evenly distributed values from 0 to 1 (clipped the two ends as we know they will be infinite). In case of QQ-plot, the data should be scaled according to the input data to match the mean and standard deviation. The plot will be as follows:

sns.pairplot(df_data, diag_kind="pp", kind="pp",
             plot_kws={'line_kws':{'color':'red', 'alpha':0.2}, 'scatter_kws': {'alpha': 0.2, 's':4}},
             diag_kws={'line_kws':{'color':'red', 'alpha':0.2}, 'scatter_kws': {'alpha': 0.2, 's':4}})
plt.show()

PPplot from seaborn

sns.pairplot(df_data, diag_kind="qq", kind="qq",
             plot_kws={'line_kws':{'color':'red', 'alpha':0.2}, 'scatter_kws': {'alpha': 0.2, 's':4}},
             diag_kws={'line_kws':{'color':'red', 'alpha':0.2}, 'scatter_kws': {'alpha': 0.2, 's':4}})
plt.show()

QQplot from seaborn

Of course, if you just want one PP-plot, you can use sns.ppplot() directly.

Better solution by PairGrid

The pairplot() function in seaborn is built with the PairGrid object. We can use it directly to make the pair plots but it would not be a single function to achieve that. However, there would be more flexibility. This is an example from the seaborn documentation that allows you to show different chart above and below the diagonal:

# Fancier correlogram
g = sns.PairGrid(df)
g.map_upper(sns.scatterplot)
g.map_lower(sns.kdeplot)
g.map_diag(sns.kdeplot)
plt.show()

pairgrid from seaborn

There can be PairGrid.map() for all charts in the grid, map_diag() and map_offdiag() for on- and off-diagonal charts, or map_upper() and map_lower() for the upper- and lower-triangular charts of the grid. We can also provide a custom function to the family of PairGrid.map(). The exception of map_diag() is to expect the custom function to take the \(x\) and \(y\) data as positional arguments, while map_diag() takes only \(x\) data. With this, we can define a few functions to be used:

import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from scipy.stats import norm

def qqplot(x, y=None, **kwargs):
    """Create Q-Q plot, for use with PairGrid.map()
    Args:
        x, y: Passed as positional arguments, will be pandas series
        kwargs: Can be "color" (tuple of 3 floats) and "label" (str), or other
                keyword arguments passed by user
    """
    # Find scatter locations
    _, xs = stats.probplot(x, fit=False)
    if y is None:
        # set it to normal distribution scaled according to x
        ys = norm.ppf(np.linspace(0,1,len(xs)+2)[1:-1])
        ys = ys * xs.std() + xs.mean()
    else:
        _, ys = stats.probplot(y, fit=False)
    # Find regression line
    p = np.polyfit(xs, ys, 1)
    xr = np.linspace(np.min(xs), np.max(xs), 2)
    yr = np.polyval(p, xr)
    # Plot options
    line_kws = kwargs.pop('line_kws', dict(c="r", ls="--", alpha=0.5))
    scatter_kws = kwargs.pop('scatter_kws', {})
    scatter_kws.update(kwargs) # all remaining keyword arg assumed for scatter
    # Plot
    plt.axline(xy1=(xr[0],yr[0]), xy2=(xr[1],yr[1]), **line_kws)
    sns.scatterplot(x=xs, y=ys, **scatter_kws)

def ppplot(x, y=None, **kwargs):
    """Create P-P plot, for use with PairGrid.map()
    Args:
        x, y: Passed as positional arguments, will be pandas series
        kwargs: Can be "color" (tuple of 3 floats) and "label" (str), or other
                keyword arguments passed by user
    """
    # Find scatter locations
    xs = np.linspace(0, 1, len(x))
    if y is None:
        # set it to normal distribution scaled according to x
        y = norm.ppf(np.linspace(0,1,len(xs)+2)[1:-1])
    ys = np.interp(np.sort(x), np.sort(y), xs)
    # Find regression line
    p = np.polyfit(xs, ys, 1)
    xr = np.linspace(np.min(xs), np.max(xs), 2)
    yr = np.polyval(p, xr)
    # Plot options
    line_kws = kwargs.pop('line_kws', dict(c="r", ls="--", alpha=0.5))
    scatter_kws = kwargs.pop('scatter_kws', {})
    scatter_kws.update(kwargs) # all remaining keyword arg assumed for scatter
    # Plot
    plt.axline(xy1=(xr[0],yr[0]), xy2=(xr[1],yr[1]), **line_kws)
    sns.scatterplot(x=xs, y=ys, **scatter_kws)

def tukeyplot(x, y, **kwargs):
    """Create Tukey mean-difference plot, for use with PairGrid.map()
    Args:
        x, y: Passed as positional arguments, will be pandas series
        kwargs: Can be "color" (tuple of 3 floats) and "label" (str), or other
                keyword arguments passed by user
    """
    # Find locations
    xs = (x+y)/2
    ys = x-y
    mean = np.mean(ys)
    sd = np.std(ys)
    # Plot options
    line_kws = kwargs.pop('line_kws', dict(ls="--", alpha=0.5))
    scatter_kws = kwargs.pop('scatter_kws', {})
    scatter_kws.update(kwargs) # all remaining keyword arg assumed for scatter
    # Plot the mean and 95% ci lines
    xl = np.linspace(np.min(xs), np.max(xs), 4)
    plt.axhline(y=mean, c="b", **line_kws)
    plt.axhline(y=mean+1.96*sd, color="r", **line_kws)
    plt.axhline(y=mean-1.96*sd, color="r", **line_kws)
    sns.scatterplot(x=xs, y=ys, **scatter_kws)

and we can produce a PP-plot with them:

g = sns.PairGrid(df)
g.map_diag(ppplot, line_kws=dict(c="r", alpha=0.3, ls="--"), scatter_kws=dict(alpha=0.5, s=5))
g.map_offdiag(ppplot, scatter_kws=dict(alpha=0.5, s=5))
g.tight_layout()
plt.show()

PPplot using PairGrid

and a QQ-plot:

g = sns.PairGrid(df)
g.map_diag(qqplot, scatter_kws=dict(alpha=0.5, s=5))
g.map_offdiag(qqplot, scatter_kws=dict(alpha=0.5, s=5))
g.tight_layout()
plt.show()

QQplot using PairGrid

and also Tukey mean-difference plot with PP-plot on diagonal, but we need to detach the shared axis for the diagonal plots due to different scale:

from itertools import product

g = sns.PairGrid(df)
g.map_offdiag(tukeyplot, scatter_kws=dict(alpha=0.5, s=5))
# Remove shared axis only for diagonal
for i,j,k in product(range(len(df.columns)), repeat=3):
    g.axes[i, i].get_shared_x_axes().remove(g.axes[j, k])
    g.axes[i, i].get_shared_y_axes().remove(g.axes[j, k])
g.map_diag(ppplot, line_kws=dict(alpha=0.3))
g.tight_layout()
plt.show()

Tukey mean-difference plot using PairGrid