Linear regression assumes a set of coordinates fits the relation
which the error term should be i.i.d. normally distributed with zero mean.
In general, with one dependend variable and various independent variable , we can define the linear regression model as
or further make this into GLM by considering a -column matrix instead of column vector, which and are also matrices of coefficient and error terms correspondingly.
The solution to simple linear model, , using ordinary least square (OLS) is
which is obtained by evaluating the derivative of the above norm term equals to zero.
The variance estimator is proportional to variance of
This is called the variance-covariance matrix and the standard error (used in the t-tests below) on the diagonal.
Although in the above, we have multiple indepdent variables , in practice they can be correlated to each other. For example, by setting for , we convert a linear regression model into a -order polynomial regression model.
The simplest form for linear regression model is , which we only have coefficients and each data point is a ordered pair . The solution would be
Such solution is in the sense that provides the BLUE — best linear unbiased estimator of the form We consider only this type of linear regression model below, with the notation
Now the question is how well this estimator is given a particular set of input. It turns out there are different ways to answer this.
(1) Is there a linear relationship between and ?
If so, it should be . In general we can ask if for some known derived from some hypothesis. In this case the t test is handy, which use the test statistic1
which should follow t-distribution with df under the null hypothesis (i.e. ). We can test if falls between .
(2) Is the intercept significant?
Similarly, we can use the t test with statitics2
A side note: Given the confidence level and SE of and , we can use t-distribution to derive the confidence interval of each coefficient and model output
Because the SE decreases as increases, we can applying this reversely to check if (but not easy to derive due to dependency of variation of inputs ) the sample size is large enough given the error margin of and confidence level .
(3) How accurate is the regression fit the data?
This is asking if the distribution of data is really what the regresion describes. To check if a model fits data, we use ANOVA. We define
- Total sum of square:
- Regression sum of square:
- Error sum of square:
and then . The RSS, with df of 1, explains the variability of due to variability of input . ESS, with df of is the remaining variability not explained by the regression model, which is the estimator for the variation of . The total variability TSS has a df of .
We can then use the F-test3 with statistic
against the F-distribution with df . The linear model is a good fit if or equivalently the -value of the linear model is given by
Indeed, a related measure is the coefficient of determination , defined as
which accounts for the proportion of variability in data explained by the regression model. A good fit if .
Interpreting Julia’s GLM package
As an example, we will see the abovementioned metrics from any statistical packages, such as GLM in Julia.
Consider the Phillips curve, namely, the empirical model that unemployment and inflation are correlated (Milton Friedman’s interpretation). To show how accurate is this hypothesis, we check the linear regression model between inflation rate (based on CPI, YoY) and unemployment rate. Trading Economics has these data since 1981.
After downloading the data in CSV format, we can read them into data frame, join them and plot:
using CSV using DataFrames using Gadfly cpi = CSV.read("hkcpiy.csv", header=["Date","Inf"], footerskip=2, datarow=2, allowmissing=:none) ue = CSV.read("hkuerate.csv", header=["Date","UE"], footerskip=2, datarow=2, allowmissing=:none) phillips = join(cpi, ue, on=:Date, kind=:inner) plot(phillips, x=:UE, y=:Inf, Geom.point, Guide.XLabel("Unemployment"), Guide.YLabel("Inflation"), Guide.Title("Inflation vs Unemployment from " * string(minimum(cpi[:Date])) * " to " * string(maximum(cpi[:Date]))) )
Perform linear regression (ordinary least square method) against the data, and plot the regression result:
using GLM ols = lm(@formula(Inf ~ UE), phillips) intercept = coef(ols) slope = coef(ols) xx = collect(0:0.1:10) yy = intercept + xx*slope plot(layer(phillips, x=:UE, y=:Inf, Geom.point), layer(x=xx, y=yy, Geom.line), Guide.XLabel("Unemployment"), Guide.YLabel("Inflation"), Guide.Title("Inflation vs Unemployment"))
The function call
lm() will print the following into the console:
Formula: Inf ~ 1 + UE Coefficients: Estimate Std.Error t value Pr(>|t|) (Intercept) 11.961 0.332921 35.9274 <1e-99 UE -2.05903 0.0822117 -25.0454 <1e-85
The estimate column above are coefficients of the constant term and the unemployment rate variable (i.e., ). It suggests that the inflation and unemployment are related as . The std error column is the numerical values of . The t value column is for and represents the test statistic . Finally, the probability column shows the confidence level corresponding to the t value. Such value suggests that there is very strong correlation between the two.
is given by
r2(ols), which gives the result 0.5893907142373523. From
this we see there is quite significant noise that is not covered by the model.
In this exercise, we are using 439 data points. If we limit the data to, say, last 10 years, the GLM result will give a much weaker , greater std error, smaller t value, and lower confidence level, even though the correlation is still established.