Regression analysis

From Free net encyclopedia

Template:Current-Math-COTW

In statistics, '..regression analysis' is used to model the relationship between random variables: One or more response variables (also called dependent variables, explained variables, predicted variables, or regressands) (usually named <math>Y</math>), and the predictors (also called independent variables, explanatory variables, control variables, or regressors,) usually named <math>X_1, ..., X_p</math>). If there is more than one response variable, we speak of multivariate regression, which is not covered in this article.

Regression analysis is most commonly associated with fitting a curve (function) to some set of measurement data (curve fitting), but it can have other several objectives:

  • Prediction of future observations, as by curve fitting
  • Determining how closely the response can be predicted by the predictor
  • Assessing the relationship between the predictors

In an experiment, the variables controlled by the experimenter are generally the predictors, while the yielded measurements are response variables. The name dependent variable was given because the response variable depends on the predictors, which are then called independent variables. However, the predictors can very well be statistically dependent (for example if one takes X and <math>X^2</math>). Therefore, the terminology "dependent" and "independent" can be confusing and should be avoided.

Note that a random variable is a function rather than a variable in the usual sense.

Contents

Types of regression

The simplest type of regression uses a procedure to find the correlation between a quantitative response variable and a quantitative predictor. The relationship between the two random variables can be defined as a linear equation. This technique is known as linear regression and can be used with a single, or multiple predictor variables. Linear regression assumes the best estimate of the response is a linear function of some parameters (though not necessarily linear on the predictors). If the relationship is not linear in parameters, a number of nonlinear regression techniques may be used to obtain an more accurate regression.

If the response variable can take only discrete values (for example, a Boolean or Yes/No variable), logistic regression is preferred. The outcome of this type of regression is a function which describes how the probability of a given event (e.g. probability of getting "yes") varies with the predictors.

Predictor variables may be defined quantitatively or qualitatively(or categorical). Categorical predictors are sometimes called factors. Depending on the nature of these predictors, a different regression technique is used:

Although these three types are the most common, there also exist Poisson regression, supervised learning, and unit-weighted regression.

Formulation

Definitions

Let <math>(\Omega,\mathcal{A},P)</math> be a probability space and <math>(\Gamma_1, S_1),\cdots,(\Gamma_p,S_p)</math> be measure spaces. <math>\Theta\subseteq\mathbb{R}^p</math> will denote a p-dimensional parameter space. Then:

  • <math>Y:(\Omega,\mathcal{A})\rightarrow(\mathbb{R},\mathcal{B}(\mathbb{R}))</math>
  • <math>\forall j\in \{1,\cdots,p\}, X_i:(\Omega,\mathcal{A})\rightarrow(\Gamma_j, S_j)</math>.

The relationship between the response and the predictors is represented mathematically by a function <math>\eta</math>:

<math>\eta:\left\{ \begin{matrix} (\Gamma_1\times\cdots\Gamma_p)\times\Theta&\rightarrow&\mathbb{R}\\ (X_1,\cdots,X_p;\theta)&\mapsto&\eta(X_1,\cdots,X_p,\theta) \end{matrix} \right.</math>

We define the error <math>\varepsilon:=Y-\eta(X_1,\cdots,X_p;\theta)</math>, which means that <math>Y=\eta(X_1,\cdots,X_p;\theta)+\varepsilon</math> or more concisely:

<math>Y=\eta(X;\theta)+\varepsilon</math>

where <math>X:=(X_1,\cdots,X_p)</math>.

We suppose that there exists a true parameter <math>\overline{\theta}\in\Theta</math> such that <math>\mathbb{E}[Y|X]=\eta(X;\overline{\theta})</math>, which means we suppose we have chosen the model <math>\eta</math> accurately because the best prediction we can make of Y given X is <math>\eta(X;\overline{\theta})</math>. The form of the function <math>\eta</math> is known, but the true parameter <math>\overline{\theta}</math> is unknown and we will estimate it with the data at hand.

The error models the variability in the experiment. Indeed, in exactly the same conditions, the output Y of the experiment might differ slightly from experiment to experiment because we cannot know or control all the parameters that have an influence on Y. This is why the response variable is represented mathematically by a random variable, which is essentially an unknown function. The error therefore represents the uncertainty we have on the modelling, i.e. the part of Y not explained by the model <math>\eta</math>.

Justification

Let <math>\mathcal{C}</math> be the σ-algebra generated by <math>X=(X_1,\cdots,X_n)</math>. Then <math>\mathbb{E}[Y|X]</math> is the only <math>\mathcal{C}</math>-measurable random variable <math>Y_0\in L^2(P)</math> for which <math>\mathbb{E}[(Y-Y_0)^2]</math> is minimal. Moreover, by the factorization lemma, there exists a measurable function <math>\eta:(\Gamma_1\times\cdots\times\Gamma_p)\rightarrow\mathbb{R}</math> such that <math>\mathbb{E}[Y|X]=\eta(X)</math>. In regression analysis, what we are in fact doing is supposing we already know the form of the function <math>\eta</math> and we are only looking for the right coefficients. In other words, we are looking for the function <math>\eta</math>, but we already know that it lies in a certain space.

Linear regression

Linear regression is the most common case in practice because it is the easiest to compute and gives good results. Indeed, by restraining the variations of the factors to a "small enough" domain, the response variable can be approximated locally by a linear function. Note that by "linear", we mean "linear in <math>\theta</math>", not "linear in X". When we do a linear regression, we are implicitly supposing that given a set of factors <math>X=(X_1,\cdots,X_p)</math>, the best approximation of the response variable <math>Y</math> we can find is a linear combination of these factors <math>X_1,\cdots,X_p</math>. Therefore, we are also supposing that <math>\forall j\in[\![1,p]\!], (\Gamma_j,S_j)=(\mathbb{R},\mathcal{B}(\mathbb{R}))</math>. The aim of linear regression is to find a good estimator of the right coefficients <math>\overline{\theta}</math> of this linear combination.

We choose <math>\eta</math> the following way:

<math>\eta(X,\theta)=\sum_{j=1}^p \theta^j X_j</math>

We now suppose that for each factor <math>X_j, j\in\{1,\cdots,p\}</math>, we have a sample of size <math>n\in\mathbb{N}^*: (X^1_j, \cdots,X^n_j)</math> and that we have the corresponding sample of Y: <math>\vec{Y}=(Y_1,\cdots,Y_n)</math>. Then we can build a matrix <math>\mathbf{X}</math> where each line represents an experiment:

<math>\mathbf{X}=\left[\begin{matrix}X_1^1&\cdots&X^1_p\\\vdots&&\vdots\\X^n_1&\cdots&X^n_p\end{matrix}\right]</math>
This is a matrix of random variables often called design matrix (for experimental designs). Each column represents a factor. As we have n trials and p factors, it is a <math>n\times p</math> matrix. We also have a corresponding error vector (of size n): <math>\vec{\varepsilon}=\vec{Y}-\eta(\mathbf{X};\overline{\theta})</math>.

Based on the sample <math>\vec{Y}=(Y_1,\cdots,Y_n)</math> and on the design matrix <math>\mathbf{X}</math>, we would like to estimate the unknown parameters <math>\overline{\theta}=(\theta^1,\cdots,\theta^p)</math> (one per factor).

Under assumptions which are met relatively often, there exists an optimal solution to the linear regression problem. These assumptions are called Gauss-Markov assumptions. See also Gauss-Markov theorem.

The Gauss-Markov assumptions

We make the following assumptions

  • <math>\mathbb{E}\vec{\varepsilon}=\vec{0}</math>
  • <math>\mathbb{V}\vec{\varepsilon}=\sigma^2 \mathbf{I}_n</math> (uncorrelated, but not necessarily independent) where <math>\sigma^2<+\infty</math> and <math>\mathbf{I}_n</math> is the <math>n\times n</math> identity matrix.

<math>\mathbb{V}\vec{\varepsilon}</math> is the variance of <math>\vec{\varepsilon}</math> and <math>\mathbb{E}\vec{\varepsilon}</math> is the expectation of <math>\vec{\varepsilon}</math> then

Least-squares estimation of the coefficients

The linear regression problem is equivalent to an orthogonal projection: we project the response variable Y on a subspace of linear functions generated by <math>(X_1,\cdots,X_p)</math>. Supposing the matrix <math>\mathbf{X}</math> is of full rank, it can be shown (for a proof of this, see least-squares estimation of linear regression coefficients) that a good estimator of the parameters <math>\overline{\theta}=(\theta^0,\cdots,\theta^p)</math> is the least-squares estimator <math>\widehat{\theta}_{LS}</math>:

<math>\widehat{\theta}_{LS}=</math> <math>(\mathbf{X}^t \mathbf{X})^{-1}\mathbf{X}^t \vec{Y}</math>

and <math>\eta(\mathbf{X};\widehat{\theta}_{LS}) = \mathbf{X}\widehat{\theta}_{LS}</math>

Limitations and alternatives to least-squares

The least-squares estimator is extremely efficient: in fact, the Gauss-Markov theorem states that under the Gauss-Markov assumptions, of all unbiased estimators of the linear regression coefficients, depending linearly on <math>\vec{Y}</math>, the least-square ones are the most efficient ones (best linear unbiased estimator or BLUE). Unfortunately, the Gauss-Markov assumptions are often not met in practical cases (for example, in the study of time series) and departure from these assumptions can corrupt the results quite significantly. A rather naïve illustration of this is given on the figure below:

All points lie on a straight line, except one and the regression line is shown in red. Just one observation has flawed the entire regression line: this method is said to be non-robust.

Several methods exist to solve this problem, the simplest of which is to assign weights to each observation (see weighted least squares). Indeed, if we know that the i-th sample is likely to be unreliable, we will downweigh it. This supposes that we know which observations are flawed, which is often a bit optimistic. Another approach is to use recursively reweighted least-squares where we compute the weights iteratively. The disadvantage of this method is that this kind of estimator cannot be computed explicitly (only recursively) and that it is much more difficult to ensure convergence, let alone accuracy. The study of such estimators has lead to a branch of statistics now called robust statistics.

Robust estimators being a bit fiddly people tend to overlook the Gauss-Markov assumptions and use least-squares even in situations where it can be ill-suited.

The optimization problem in regression is typically solved by algorithms such as the gradient descent algorithm, the Gauss-Newton algorithm, and the Levenberg-Marquardt algorithm. Probabilistic algorithms such as RANSAC can be used to find a good fit for a sample set, given a parametrized model of the curve function. For more complex, non-linear regression artificial neural networks are commonly used.

Inference

Until now, we haven't assumed any distribution for the errors. However, if we want to make any confidence interval or perform any hypothesis tests, we have to suppose normality, homoscedasticity, and uncorrelatedness, i.e.:

<math>\vec{\varepsilon}\sim\mathcal{N}(\mathbf{0},\sigma^2\mathbf{I}_n)</math>

Naturally, these assumptions imply the Gauss-Markov ones.

Confidence intervals

How much confidence can we have in the values of <math>\widehat{\theta}_{LS}</math> we estimated from the data? To answer, we first notice that the previous assumptions imply:

<math>\vec{Y}\sim\mathcal{N}(\mathbf{X}\overline{\theta},\sigma^2 \mathbf{I}_n)</math>

Then we can get the distribution of the least-square estimation of the parameters.

From <math>\eta(\mathbf{X};\hat{\theta}_{LS})=\mathbf{X}\widehat{\theta}_{LS}</math> and <math>\widehat{\sigma}^2:=\frac{1}{n-p}\|\vec{Y}-\eta(\mathbf{X};\hat{\theta}_{LS})\|^2</math> (with <math>\|u\|^2=u^t u</math>), we get:

<math>\widehat{\theta}_{LS}\sim\mathcal{N}(\overline{\theta},\sigma^2(\mathbf{X}^t \mathbf{X})^{-1}),</math>
<math>\frac{n-p}{\sigma^2}\widehat{\sigma}^2\sim\chi^2_{n-p},</math>
and <math>\frac{1}{\sigma^2}\|\vec{Y}-\eta(\mathbf{X};\hat{\theta}_{LS})\|^2\sim\chi_{p}^2.</math>

For <math>1\leq j\leq p</math>, if we name <math>s_j</math> the <math>j</math>-th diagonal element of the matrix <math>(\mathbf{X}^t\mathbf{X})^{-1} </math>, a <math>1-\alpha</math> confidence interval for each <math>\theta_j</math> is therefore:

<math>[\widehat{\theta_j}-\widehat{\sigma}\sqrt{s_j}t_{n-p;1-\frac{\alpha}{2}};\widehat{\theta_j}+\widehat{\sigma}\sqrt{s_j}t_{n-p;1-\frac{\alpha}{2}}].</math>

Hypothesis testing

In regression, we usually test the hypothesis that one or more of the parameters is zero against the alternative that all parameters are non-zero.

Regression and Bayesian statistics

Maximum likelihood is one method of estimating the parameters of a regression model, which behaves well for large samples. However, for small amounts of data, the estimates can have high variance or bias. Bayesian methods can also be used to estimate regression models. A prior is placed over the parameters, which incorporates everything known about the parameters. (For example, if one parameter is known to be non-negative, a non-negative distribution can be assigned to it.) A posterior distribution is then obtained for the parameter vector. Bayesian methods have the advantages that they use all the information that is available. They are exact, not asymptotic, and thus work well for small data sets if some contextual information is available to be used in the prior. Some practitioners use maximum a posteriori (MAP) methods, a simpler method than full Bayesian analysis, in which the parameters are chosen that maximize the posterior. MAP methods are related to Occam's Razor: there is a preference for simplicity among a family of regression models (curves) just as there is a preference for simplicity among competing theories.

Examples

To illustrate the various goals of regression, we will give three examples.

Prediction of future observations

The following data set gives the average heights and weights for American women aged 30-39 (source: The World Almanac and Book of Facts, 1975).

Height (in) 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
Weight (lbs) 115 117 120 123 126 129 132 135 139 142 146 150 154 159 164

We would like to see how the weight of these women depends on their height. We are therefore looking for a function <math>\eta</math> such that <math>Y=\eta(X)+\varepsilon</math>, where Y is the weight of the women and X their height. Intuitively, we can guess that if the women's proportions are constant and their density too, then the weight of the women must depend on the cube of their height. A plot of the data set confirms this supposition:

<math>\vec{X}</math> will denote the vector containing all the measured heights (<math>\vec{X}=(58,59,60,\cdots)</math>) and <math>\vec{Y}=(115,117,120,\cdots)</math> is the vector containing all measured weights. We can suppose the heights of the women are independant from each other and have constant variance, which means the Gauss-Markov assumptions hold. We can therefore use the least-squares estimator, i.e. we are looking for coefficients <math>\theta^0, \theta^1</math> and <math>\theta^2</math> satisfying as well as possible (in the sense of the least-squares estimator) the equation:

<math>\vec{Y}=\theta^0 + \theta^1 \vec{X} + \theta^2 \vec{X}^3+\vec{\varepsilon}</math>

Geometrically, what we will be doing is an orthogonal projection of Y on the subspace generated by the variables <math>1, X</math> and <math>X^3</math>. The matrix X is constructed simply by putting a first column of 1's (the constant term in the model) a column with the original values (the X in the model) and a third column with these values cubed (<math>X^3</math>). The realization of this matrix (i.e. for the data at hand) can be written:

<math>1</math> <math>x</math> <math>x^3</math>
1 58 195112
1 59 205379
1 60 216000
1 61 226981
1 62 238328
1 63 250047
1 64 262144
1 65 274625
1 66 287496
1 67 300763
1 68 314432
1 69 328509
1 70 343000
1 71 357911
1 72 373248

The matrix <math>(\mathbf{X}^t \mathbf{X})^{-1}</math> (sometimes called "information matrix" or "dispersion matrix") is:

<math> \left[\begin{matrix} 1.9\cdot10^3&-45&3.5\cdot 10^{-3}\\ -45&1.0&-8.1\cdot 10^{-5}\\ 3.5\cdot 10^{-3}&-8.1\cdot 10^{-5}&6.4\cdot 10^{-9} \end{matrix}\right]</math>

Vector <math>\widehat{\theta}_{LS}</math> is therefore:

<math>\widehat{\theta}_{LS}=(X^tX)^{-1}X^{t}y= (147, -2.0, 4.3\cdot 10^{-4})</math>

hence <math>\eta(X) = 147 - 2.0 X + 4.3\cdot 10^{-4} X^3</math>

A plot of this function shows that it lies quite closely to the data set:

The confidence intervals are computed using:

<math>[\widehat{\theta_j}-\widehat{\sigma}\sqrt{s_j}t_{n-p;1-\frac{\alpha}{2}};\widehat{\theta_j}+\widehat{\sigma}\sqrt{s_j}t_{n-p;1-\frac{\alpha}{2}}]</math>

with:

<math>\widehat{\sigma}=0.52</math>
<math>s_1=1.\cdot 10^3, s_2=1.0, s_3=6.4\cdot 10^{-9}\;</math>
<math>\alpha=5\%</math>
<math>t_{n-p;1-\frac{\alpha}{2}}=2.2</math>

Therefore, we can say that with a probability of 0.95,

<math>\theta^0\in[112 , 181]</math>
<math>\theta^1\in[-2.8 , -1.2]</math>
<math>\theta^2\in[3.6\cdot 10^{-4} , 4.9\cdot 10^{-4}]</math>

See also

References

  • Audi, R., Ed. (1996) The Cambridge Dictionary of Philosophy. Cambridge, Cambridge University Press. curve fitting problem p.172-173.
  • Birkes, David and Yadolah Dodge, Alternative Methods of Regression (1993), ISBN 0-471-56881-3
  • Chatfield, C. (1993) "Calculating Interval Forecasts," Journal of Business and Economic Statistics, 11 121-135.
  • Fox, J., Applied Regression Analysis, Linear Models and Related Methods. (1997), Sage
  • Hardle, W., Applied Nonparametric Regression (1990), ISBN 0-521-42950-1
  • Meade, N. and T. Islam (1995) "Prediction Intervals for Growth Curve Forecasts," Journal of Forecasting, 14 413-430.

External links

ja:回帰分析 nl:Regressie-analyse pl:Regresja (statystyka) ru:Статистическая регрессия