Regression analysis
From Free net encyclopedia
In statistics, '..regression analysis' is used to model the relationship between random variables: One or more response variables (also called dependent variables, explained variables, predicted variables, or regressands) (usually named <math>Y</math>), and the predictors (also called independent variables, explanatory variables, control variables, or regressors,) usually named <math>X_1, ..., X_p</math>). If there is more than one response variable, we speak of multivariate regression, which is not covered in this article.
Regression analysis is most commonly associated with fitting a curve (function) to some set of measurement data (curve fitting), but it can have other several objectives:
- Prediction of future observations, as by curve fitting
- Determining how closely the response can be predicted by the predictor
- Assessing the relationship between the predictors
In an experiment, the variables controlled by the experimenter are generally the predictors, while the yielded measurements are response variables. The name dependent variable was given because the response variable depends on the predictors, which are then called independent variables. However, the predictors can very well be statistically dependent (for example if one takes X and <math>X^2</math>). Therefore, the terminology "dependent" and "independent" can be confusing and should be avoided.
Note that a random variable is a function rather than a variable in the usual sense.
Contents |
Types of regression
The simplest type of regression uses a procedure to find the correlation between a quantitative response variable and a quantitative predictor. The relationship between the two random variables can be defined as a linear equation. This technique is known as linear regression and can be used with a single, or multiple predictor variables. Linear regression assumes the best estimate of the response is a linear function of some parameters (though not necessarily linear on the predictors). If the relationship is not linear in parameters, a number of nonlinear regression techniques may be used to obtain an more accurate regression.
If the response variable can take only discrete values (for example, a Boolean or Yes/No variable), logistic regression is preferred. The outcome of this type of regression is a function which describes how the probability of a given event (e.g. probability of getting "yes") varies with the predictors.
Predictor variables may be defined quantitatively or qualitatively(or categorical). Categorical predictors are sometimes called factors. Depending on the nature of these predictors, a different regression technique is used:
- If the predictors are all quantitative, we speak of multiple regression.
- If the predictors are all qualitative, one performs analysis of variance.
- If some predictors are quantitative and some qualitative, one performs an analysis of covariance.
Although these three types are the most common, there also exist Poisson regression, supervised learning, and unit-weighted regression.
Formulation
Definitions
Let <math>(\Omega,\mathcal{A},P)</math> be a probability space and <math>(\Gamma_1, S_1),\cdots,(\Gamma_p,S_p)</math> be measure spaces. <math>\Theta\subseteq\mathbb{R}^p</math> will denote a p-dimensional parameter space. Then:
- <math>Y:(\Omega,\mathcal{A})\rightarrow(\mathbb{R},\mathcal{B}(\mathbb{R}))</math>
- <math>\forall j\in \{1,\cdots,p\}, X_i:(\Omega,\mathcal{A})\rightarrow(\Gamma_j, S_j)</math>.
The relationship between the response and the predictors is represented mathematically by a function <math>\eta</math>:
<math>\eta:\left\{ \begin{matrix} (\Gamma_1\times\cdots\Gamma_p)\times\Theta&\rightarrow&\mathbb{R}\\ (X_1,\cdots,X_p;\theta)&\mapsto&\eta(X_1,\cdots,X_p,\theta) \end{matrix} \right.</math>
We define the error <math>\varepsilon:=Y-\eta(X_1,\cdots,X_p;\theta)</math>, which means that <math>Y=\eta(X_1,\cdots,X_p;\theta)+\varepsilon</math> or more concisely:
where <math>X:=(X_1,\cdots,X_p)</math>.
We suppose that there exists a true parameter <math>\overline{\theta}\in\Theta</math> such that <math>\mathbb{E}[Y|X]=\eta(X;\overline{\theta})</math>, which means we suppose we have chosen the model <math>\eta</math> accurately because the best prediction we can make of Y given X is <math>\eta(X;\overline{\theta})</math>. The form of the function <math>\eta</math> is known, but the true parameter <math>\overline{\theta}</math> is unknown and we will estimate it with the data at hand.
The error models the variability in the experiment. Indeed, in exactly the same conditions, the output Y of the experiment might differ slightly from experiment to experiment because we cannot know or control all the parameters that have an influence on Y. This is why the response variable is represented mathematically by a random variable, which is essentially an unknown function. The error therefore represents the uncertainty we have on the modelling, i.e. the part of Y not explained by the model <math>\eta</math>.
Justification
Let <math>\mathcal{C}</math> be the σ-algebra generated by <math>X=(X_1,\cdots,X_n)</math>. Then <math>\mathbb{E}[Y|X]</math> is the only <math>\mathcal{C}</math>-measurable random variable <math>Y_0\in L^2(P)</math> for which <math>\mathbb{E}[(Y-Y_0)^2]</math> is minimal. Moreover, by the factorization lemma, there exists a measurable function <math>\eta:(\Gamma_1\times\cdots\times\Gamma_p)\rightarrow\mathbb{R}</math> such that <math>\mathbb{E}[Y|X]=\eta(X)</math>. In regression analysis, what we are in fact doing is supposing we already know the form of the function <math>\eta</math> and we are only looking for the right coefficients. In other words, we are looking for the function <math>\eta</math>, but we already know that it lies in a certain space.
Linear regression
Linear regression is the most common case in practice because it is the easiest to compute and gives good results. Indeed, by restraining the variations of the factors to a "small enough" domain, the response variable can be approximated locally by a linear function. Note that by "linear", we mean "linear in <math>\theta</math>", not "linear in X". When we do a linear regression, we are implicitly supposing that given a set of factors <math>X=(X_1,\cdots,X_p)</math>, the best approximation of the response variable <math>Y</math> we can find is a linear combination of these factors <math>X_1,\cdots,X_p</math>. Therefore, we are also supposing that <math>\forall j\in[\![1,p]\!], (\Gamma_j,S_j)=(\mathbb{R},\mathcal{B}(\mathbb{R}))</math>. The aim of linear regression is to find a good estimator of the right coefficients <math>\overline{\theta}</math> of this linear combination.
We choose <math>\eta</math> the following way:
- <math>\eta(X,\theta)=\sum_{j=1}^p \theta^j X_j</math>
We now suppose that for each factor <math>X_j, j\in\{1,\cdots,p\}</math>, we have a sample of size <math>n\in\mathbb{N}^*: (X^1_j, \cdots,X^n_j)</math> and that we have the corresponding sample of Y: <math>\vec{Y}=(Y_1,\cdots,Y_n)</math>. Then we can build a matrix <math>\mathbf{X}</math> where each line represents an experiment:
Based on the sample <math>\vec{Y}=(Y_1,\cdots,Y_n)</math> and on the design matrix <math>\mathbf{X}</math>, we would like to estimate the unknown parameters <math>\overline{\theta}=(\theta^1,\cdots,\theta^p)</math> (one per factor).
Under assumptions which are met relatively often, there exists an optimal solution to the linear regression problem. These assumptions are called Gauss-Markov assumptions. See also Gauss-Markov theorem.
The Gauss-Markov assumptions
We make the following assumptions
- <math>\mathbb{E}\vec{\varepsilon}=\vec{0}</math>
- <math>\mathbb{V}\vec{\varepsilon}=\sigma^2 \mathbf{I}_n</math> (uncorrelated, but not necessarily independent) where <math>\sigma^2<+\infty</math> and <math>\mathbf{I}_n</math> is the <math>n\times n</math> identity matrix.
<math>\mathbb{V}\vec{\varepsilon}</math> is the variance of <math>\vec{\varepsilon}</math> and <math>\mathbb{E}\vec{\varepsilon}</math> is the expectation of <math>\vec{\varepsilon}</math> then
Least-squares estimation of the coefficients
The linear regression problem is equivalent to an orthogonal projection: we project the response variable Y on a subspace of linear functions generated by <math>(X_1,\cdots,X_p)</math>. Supposing the matrix <math>\mathbf{X}</math> is of full rank, it can be shown (for a proof of this, see least-squares estimation of linear regression coefficients) that a good estimator of the parameters <math>\overline{\theta}=(\theta^0,\cdots,\theta^p)</math> is the least-squares estimator <math>\widehat{\theta}_{LS}</math>:
<math>\widehat{\theta}_{LS}=</math> <math>(\mathbf{X}^t \mathbf{X})^{-1}\mathbf{X}^t \vec{Y}</math>
and <math>\eta(\mathbf{X};\widehat{\theta}_{LS}) = \mathbf{X}\widehat{\theta}_{LS}</math>
Limitations and alternatives to least-squares
The least-squares estimator is extremely efficient: in fact, the Gauss-Markov theorem states that under the Gauss-Markov assumptions, of all unbiased estimators of the linear regression coefficients, depending linearly on <math>\vec{Y}</math>, the least-square ones are the most efficient ones (best linear unbiased estimator or BLUE). Unfortunately, the Gauss-Markov assumptions are often not met in practical cases (for example, in the study of time series) and departure from these assumptions can corrupt the results quite significantly. A rather naïve illustration of this is given on the figure below:
All points lie on a straight line, except one and the regression line is shown in red. Just one observation has flawed the entire regression line: this method is said to be non-robust.
Several methods exist to solve this problem, the simplest of which is to assign weights to each observation (see weighted least squares). Indeed, if we know that the i-th sample is likely to be unreliable, we will downweigh it. This supposes that we know which observations are flawed, which is often a bit optimistic. Another approach is to use recursively reweighted least-squares where we compute the weights iteratively. The disadvantage of this method is that this kind of estimator cannot be computed explicitly (only recursively) and that it is much more difficult to ensure convergence, let alone accuracy. The study of such estimators has lead to a branch of statistics now called robust statistics.
Robust estimators being a bit fiddly people tend to overlook the Gauss-Markov assumptions and use least-squares even in situations where it can be ill-suited.
- If the error term is not normal but forms an exponential family one can use generalized linear models. Other techniques include the use of weighted least squares or transforming the dependent variable using the Box-Cox transformation.
- If outliers are present the normal distribution can be replaced by a t-distribution or, alternatively, robust regression methods may be used.
- If the predictor is not linear a nonparametric regression or semiparametric regression or nonlinear regression may be used.
The optimization problem in regression is typically solved by algorithms such as the gradient descent algorithm, the Gauss-Newton algorithm, and the Levenberg-Marquardt algorithm. Probabilistic algorithms such as RANSAC can be used to find a good fit for a sample set, given a parametrized model of the curve function. For more complex, non-linear regression artificial neural networks are commonly used.
Inference
Until now, we haven't assumed any distribution for the errors. However, if we want to make any confidence interval or perform any hypothesis tests, we have to suppose normality, homoscedasticity, and uncorrelatedness, i.e.:
Naturally, these assumptions imply the Gauss-Markov ones.
Confidence intervals
How much confidence can we have in the values of <math>\widehat{\theta}_{LS}</math> we estimated from the data? To answer, we first notice that the previous assumptions imply:
- <math>\vec{Y}\sim\mathcal{N}(\mathbf{X}\overline{\theta},\sigma^2 \mathbf{I}_n)</math>
Then we can get the distribution of the least-square estimation of the parameters.
From <math>\eta(\mathbf{X};\hat{\theta}_{LS})=\mathbf{X}\widehat{\theta}_{LS}</math> and <math>\widehat{\sigma}^2:=\frac{1}{n-p}\|\vec{Y}-\eta(\mathbf{X};\hat{\theta}_{LS})\|^2</math> (with <math>\|u\|^2=u^t u</math>), we get:
- <math>\widehat{\theta}_{LS}\sim\mathcal{N}(\overline{\theta},\sigma^2(\mathbf{X}^t \mathbf{X})^{-1}),</math>
- <math>\frac{n-p}{\sigma^2}\widehat{\sigma}^2\sim\chi^2_{n-p},</math>
- and <math>\frac{1}{\sigma^2}\|\vec{Y}-\eta(\mathbf{X};\hat{\theta}_{LS})\|^2\sim\chi_{p}^2.</math>
For <math>1\leq j\leq p</math>, if we name <math>s_j</math> the <math>j</math>-th diagonal element of the matrix <math>(\mathbf{X}^t\mathbf{X})^{-1} </math>, a <math>1-\alpha</math> confidence interval for each <math>\theta_j</math> is therefore:
- <math>[\widehat{\theta_j}-\widehat{\sigma}\sqrt{s_j}t_{n-p;1-\frac{\alpha}{2}};\widehat{\theta_j}+\widehat{\sigma}\sqrt{s_j}t_{n-p;1-\frac{\alpha}{2}}].</math>
Hypothesis testing
In regression, we usually test the hypothesis that one or more of the parameters is zero against the alternative that all parameters are non-zero.
Regression and Bayesian statistics
Maximum likelihood is one method of estimating the parameters of a regression model, which behaves well for large samples. However, for small amounts of data, the estimates can have high variance or bias. Bayesian methods can also be used to estimate regression models. A prior is placed over the parameters, which incorporates everything known about the parameters. (For example, if one parameter is known to be non-negative, a non-negative distribution can be assigned to it.) A posterior distribution is then obtained for the parameter vector. Bayesian methods have the advantages that they use all the information that is available. They are exact, not asymptotic, and thus work well for small data sets if some contextual information is available to be used in the prior. Some practitioners use maximum a posteriori (MAP) methods, a simpler method than full Bayesian analysis, in which the parameters are chosen that maximize the posterior. MAP methods are related to Occam's Razor: there is a preference for simplicity among a family of regression models (curves) just as there is a preference for simplicity among competing theories.
Examples
To illustrate the various goals of regression, we will give three examples.
Prediction of future observations
The following data set gives the average heights and weights for American women aged 30-39 (source: The World Almanac and Book of Facts, 1975).
Height (in) | 58 | 59 | 60 | 61 | 62 | 63 | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 |
Weight (lbs) | 115 | 117 | 120 | 123 | 126 | 129 | 132 | 135 | 139 | 142 | 146 | 150 | 154 | 159 | 164 |
<math>\vec{X}</math> will denote the vector containing all the measured heights (<math>\vec{X}=(58,59,60,\cdots)</math>) and <math>\vec{Y}=(115,117,120,\cdots)</math> is the vector containing all measured weights. We can suppose the heights of the women are independant from each other and have constant variance, which means the Gauss-Markov assumptions hold. We can therefore use the least-squares estimator, i.e. we are looking for coefficients <math>\theta^0, \theta^1</math> and <math>\theta^2</math> satisfying as well as possible (in the sense of the least-squares estimator) the equation:
- <math>\vec{Y}=\theta^0 + \theta^1 \vec{X} + \theta^2 \vec{X}^3+\vec{\varepsilon}</math>
Geometrically, what we will be doing is an orthogonal projection of Y on the subspace generated by the variables <math>1, X</math> and <math>X^3</math>. The matrix X is constructed simply by putting a first column of 1's (the constant term in the model) a column with the original values (the X in the model) and a third column with these values cubed (<math>X^3</math>). The realization of this matrix (i.e. for the data at hand) can be written:
<math>1</math> | <math>x</math> | <math>x^3</math> |
1 | 58 | 195112 |
1 | 59 | 205379 |
1 | 60 | 216000 |
1 | 61 | 226981 |
1 | 62 | 238328 |
1 | 63 | 250047 |
1 | 64 | 262144 |
1 | 65 | 274625 |
1 | 66 | 287496 |
1 | 67 | 300763 |
1 | 68 | 314432 |
1 | 69 | 328509 |
1 | 70 | 343000 |
1 | 71 | 357911 |
1 | 72 | 373248 |
The matrix <math>(\mathbf{X}^t \mathbf{X})^{-1}</math> (sometimes called "information matrix" or "dispersion matrix") is:
<math> \left[\begin{matrix} 1.9\cdot10^3&-45&3.5\cdot 10^{-3}\\ -45&1.0&-8.1\cdot 10^{-5}\\ 3.5\cdot 10^{-3}&-8.1\cdot 10^{-5}&6.4\cdot 10^{-9} \end{matrix}\right]</math>
Vector <math>\widehat{\theta}_{LS}</math> is therefore:
<math>\widehat{\theta}_{LS}=(X^tX)^{-1}X^{t}y= (147, -2.0, 4.3\cdot 10^{-4})</math>
hence <math>\eta(X) = 147 - 2.0 X + 4.3\cdot 10^{-4} X^3</math>
A plot of this function shows that it lies quite closely to the data set:
The confidence intervals are computed using:
- <math>[\widehat{\theta_j}-\widehat{\sigma}\sqrt{s_j}t_{n-p;1-\frac{\alpha}{2}};\widehat{\theta_j}+\widehat{\sigma}\sqrt{s_j}t_{n-p;1-\frac{\alpha}{2}}]</math>
with:
- <math>\widehat{\sigma}=0.52</math>
- <math>s_1=1.\cdot 10^3, s_2=1.0, s_3=6.4\cdot 10^{-9}\;</math>
- <math>\alpha=5\%</math>
- <math>t_{n-p;1-\frac{\alpha}{2}}=2.2</math>
Therefore, we can say that with a probability of 0.95,
- <math>\theta^0\in[112 , 181]</math>
- <math>\theta^1\in[-2.8 , -1.2]</math>
- <math>\theta^2\in[3.6\cdot 10^{-4} , 4.9\cdot 10^{-4}]</math>
See also
- Confidence interval
- Extrapolation
- Kriging
- Prediction
- Prediction interval
- Statistics
- Trend estimation
- multivariate normal distribution
- important publications in regression analysis.
References
- Audi, R., Ed. (1996) The Cambridge Dictionary of Philosophy. Cambridge, Cambridge University Press. curve fitting problem p.172-173.
- Birkes, David and Yadolah Dodge, Alternative Methods of Regression (1993), ISBN 0-471-56881-3
- Chatfield, C. (1993) "Calculating Interval Forecasts," Journal of Business and Economic Statistics, 11 121-135.
- Fox, J., Applied Regression Analysis, Linear Models and Related Methods. (1997), Sage
- Hardle, W., Applied Nonparametric Regression (1990), ISBN 0-521-42950-1
- Meade, N. and T. Islam (1995) "Prediction Intervals for Growth Curve Forecasts," Journal of Forecasting, 14 413-430.
External links
- SixSigmaFirst - Intro to regression analysis, and linear regression example
- Curve Expert - Shareware to fit a curve to your data, by selecting an appropriate regression model
- Zunzun.com - Online curve and surface fitting
- Curvefit - Online ten-point demo
- Curvefit: A complete guide to nonlinear regression - Online textbook
- The R Project - Free software for statistics, including regression and graphics
- TableCurve2D and TableCurve3D by Systat - Automated regression software
- Mazoo's Learning Blog - Example of linear regression. Shows how to find the linear regression equation, variances, standard errors, coefficients of correlation and determination, and confidence interval.
- Regression of Weakly Correlated Data - How linear regression mistakes can appear when Y-range is much smaller than X-rangede:Regressionsanalyse
ja:回帰分析 nl:Regressie-analyse pl:Regresja (statystyka) ru:Статистическая регрессия