back to book: 2013-12-01

Saturday, December 7, 2013

Text and links may be out of date

Correlation, and regression analysis for curve fitting

The techniques described on this page are used to investigate relationships between two variables (x and y). Is a change in one of these variables associated with a change in the other? For example, if we increase the temperature do we increase the growth rate of a culture or the rate of a chemical reaction? Does an increase in DDT content of bird tissues correlate with thinning of the egg shell? Is an increase in slug density in a field plot associated with a decrease in seedling development?
We can use the technique of correlation to test the statistical significance of the association. In other cases we use regression analysis to describe the relationship precisely by means of an equation that has predictive value. We deal separately with these two types of analysis - correlation and regression - because they have different roles.
Correlation
Suppose that we took 7 mice and measured their body weight and their length from nose to tail. We obtained the following results and want to know if there is any relationship between the measured variables. [To keep the calculations simple, we will use small numbers]

Mouse	Units of weight (x)	Units of length (y)
1	1	2
2	4	5
3	3	8
4	4	12
5	8	14
6	9	19
7	8	22

Procedure
(1) Plot the results on graph paper. This is the essential first step, because only then can we see what the relationship might be - is it linear, logarithmic, sigmoid, etc?

In our case the relationship seems to be linear, so we will continue on that assumption. If it does not seem to be linear we might need to transform the data.

(2) Set out a table as follows and calculate S x, S y, S x², S y², S xy,

and (mean of y).

	Weight (x)	Length (y)	x²	y²	xy
Mouse 1	1	2	1	4	2
Mouse 2	4	5	16	25	20
Mouse 3	3	8	9	64	24
Mouse 4	4	12	16	144	48
Mouse 5	8	14	64	196	112
Mouse 6	9	19	81	361	152
Mouse 7	8	22	64	484	176
Total	S x = 37	S y = 82	S x² = 251	S y² = 1278	S xy = 553
Mean	= 5.286	= 11.714

(3) Calculate

= 55.429 in our case.
(4) Calculate

= 317.429 in our case.
(5) Calculate

(this can be positive or negative) = 119.571.
(6) Calculate r (correlation coefficient):

= 0.9014 in our case.
(7) Look up r in a table of correlation coefficients (ignoring + or - sign). The number of degrees of freedom is two less than the number of points on the graph (5 df in our example because we have 7 points). If our calculated r value exceeds the tabulated value at p = 0.05 then the correlation is significant. Our calculated value (0.9014) does exceed the tabulated value (0.754). It also exceeds the tabulated value for p = 0.01 but not for p = 0.001. If the null hypothesis were true (that there is no relationship between length and weight) we would have obtained a correlation coefficient as high as this in less than 1 in 100 times. So we can be confident that weight and length are positively correlated in our sample of mice.
Important notes:
1. If the calculated r value is positive (as in this case) then the slope will rise from left to right on the graph. As weight increases, so does the length. If the calculated value of r is negative the slope will fall from left to right. This would indicate that length decreases as weight increases.
2. The r value will always lie between -1 and +1. If you have an r value outside of this range you have made an error in the calculations.
3. Remember that a correlation does not necessarily demonstrate a causal relationship. A significant correlation only shows that two factors vary in a related way (positively or negatively). This is obvious in our example because there is no logical reason to think that weight influences the length of the animal (both factors are influenced by age or growth stage). But it can be easy to fall into the "causality trap" when looking at other types of correlation.
What does the correlation coefficient mean?
The part above the line in this equation is a measure of the degree to which x and y vary together (using the deviations d of each from the mean). The part below the line is a measure of the degree to which x and y vary separately.

Regression analysis: fitting a line to the data
It would be tempting to try to fit a line to the data we have just analysed - producing an equation that shows the relationship, so that we might predict the body weight of mice by measuring their length, or vice-versa. The method for this is called linear regression.
However, this is not strictly valid because linear regression is based on a number of assumptions. In particular, one of the variables must be "fixed" experimentally and/or precisely measureable. So, the simple linear regression methods can be used only when we define some experimental variable (temperature, pH, dosage, etc.) and test the response of another variable to it.
The variable that we fix (or choose deliberately) is termed the independent variable. It is always plotted on the X axis. The other variable is termed the dependent variable and is plotted on the Y axis.
Suppose that we had the following results from an experiment in which we measured the growth of a cell culture (as optical density) at different pH levels.

pH	Optical density
3	0.1
4	0.2
4.5	0.25
5	0.32
5.5	0.33
6	0.35
6.5	0.47
7	0.49
7.5	0.53

We plot these results (see below) and they suggest a straight-line relationship.

Using the same procedures as for correlation, set out a table as follows and calculate S x, S y, S x², S y², S xy,

and (mean of y).

	pH (x)	Optical density (y)	x²	y²	xy
	3	0.1	9	0.01	0.3
	4	0.2	16	0.04	0.8
	4.5	0.25	20.25	0.0625	1.125
	5	0.32	25	0.1024	1.6
	5.5	0.33	30.25	0.1089	1.815
	6	0.35	36	0.1225	2.1
	6.5	0.47	42.25	0.2209	3.055
	7	0.49	49	0.240	3.43
	7.5	0.53	56.25	0.281	3.975
Total	S x = 49	S y = 3.04	S x² = 284	S y² = 1.1882	S xy = 18.2
Mean	= 5.444	= 0.3378

Now calculate

= 17.22 in our case.
Calculate

= 0.1614 in our case.
Calculate

(this can be positive or negative) = +1.649
Now we want to use regression analysis to find the line of best fit to the data. We have done nearly all the work for this in the calculations above.
The regression equation for y on x is: y = bx + a where b is the slope and a is the intercept (the point where the line crosses the y axis)
We calculate b as:

= 1.649 x 17.22 = 0.0958 in our case
We calculate a as:
a = - b

From the known values of (0.3378),

(5.444) and b (0.0958) we thus find a (-0.1837).
So the equation for the line of best fit is: y = 0.096x - 0.184 (to 3 decimal places).
To draw the line through the data points, we substitute in this equation. For example:
when x = 4, y = 0.384, so one point on the line has the x,y coordinates (4, 0.384);
when x = 7, y = 0.488, so another point on the line has the x,y coordinates (7, 0.488).
It is also true that the line of best fit always passes through the point with coordinates

, so we actually need only one other calculated point in order to draw a straight line.
Regression analysis using Microsoft Excel
Below is a printout of the Regression analysis from Microsoft "Excel". It is obtained simply by entering two columns of data (x and y) then clicking "Tools - Data analysis - Regression". We see that it gives us the correlation coefficient r (as "Multiple R"), the intercept and the slope of the line (seen as the "coefficient for pH" on the last line of the table). It also shows us the result of an Analysis of Variance (ANOVA) to calculate the significance of the regression (4.36 X 10^-7).

Regression Statistics
Multiple R	0.989133329
R Square	0.978384742
Adjusted R Square	0.975296848
Standard Error	0.022321488
Observations	9

ANOVA
	df	SS	MS	F	Significance F
Regression	1	0.157868	0.157868	316.8453	4.36E-07
Residual	7	0.003488	0.000498
Total	8	0.161356

	Coefficients	Standard Error	t Stat	P-value	Lower 95%	Upper 95%	Lower 95.0%	Upper 95.0%
Intercept	-0.18348387	0.030215	-6.07269	0.000504	-0.25493	-0.11204	-0.25493	-0.11204
pH	0.095741935	0.005379	17.80015	4.36E-07	0.083023	0.108461	0.083023	0.108461

Presenting the results
The final graph should show:
(i) all measured data points;
(ii) the line of best fit;
(iii) the equation for the line;
(iv) the R² and p values.
Further applications: logarithmic and sigmoid curves
When we plot our initial results on a graph it will usually be clear whether they best fit a linear relationship or a logarithmic relationship or something else, like a sigmoid curve. We can analyse all these relationships in exactly the same way as above if we transform the x and y values as appropriate so that the relationship between x and y becomes linear. BEWARE - you MUST look at a scatter plot on graph paper to see what type of relationship you have. If you simply instruct a computer programme such as "Excel" to run a regression on untransformed data it will do this by assuming that the relationship is linear!
(i) For plots of data that suggest exponential (logarithmic) growth, convert all y values to log of y (using either log₁₀ or log_e). Then go through the linear regression procedures above, using the log y data instead of y data.
(ii) For sigmoid curves (drug dose response curves and UV killing curves are often sigmoid), the y values (proportion of the population responding to the treatment) can be converted using a logistic or probit transformation. Sometimes it is useful to convert the x (dose) data to logarithms; this condenses the x values, removing the long tails of non-responding individuals at the lowest and highest dose levels. A plot of logistic or probit (y) against dose (x) or log of dose (x) should show a straight-line relationship.
Converting between percentage, arcsin, logistic and probits in ‘Excel’
The table below shows part of a page from an ‘Excel’ worsksheet, produced as an exercise to show how transformations are performed. Columns in an Excel worksheet are headed A-F and rows are labelled 1-21, so each cell in the table can be identified (e.g. B2 or F11). Representative Proportions were inserted in cells A2-A21, and % values were inserted in cells B2-B21.
Then a formula was entered in cell C2 to convert Proportions to logistic values
The logistic transformation converts y to log(y/(1-y))
The formula (without spaces) entered into cell C2 was: =LOG(A2/(1-A2))
This formula is not seen in the cell, but as soon as we move out of cell C2 it automatically gives the logistic value (in C2) for the proportion in cell A2, seen in the printout below. Copying and then pasting this formula into every other cell of column C produces a corresponding logistic value (e.g. cell C3 contains the logistic value of the proportion in cell A3).
Similarly, a formula was entered in cell D2 to convert Percentage to Probit values.
The formula (without spaces) is: =NORMINV(B2/100,5,1) This was then pasted into all cells of column D
Next, a formula was entered in cell E2 to convert Probit to Percentage, and pasted into all cells of column E
The formula is: =NORMDIST(C2,5,1,TRUE)*100
The formula entered in cell F2 converts Percentage to Arcsine
The formula is: =ASIN(SQRT(A2/100))*180/PI()
The formula in cell G2 converts Arcsine to Percentage
The formula is: =SIN(E2/180*PI())^2*100

	A	B	C	D	E	F	G
1	Proportion	Percent	Proportion to logistic	% to Probit	Probit to %	% to arcsin	arcsin to %
2	0.001	0.1	-2.99957	1.91	0.1	1.812	0.1
3	0.005	0.5	-2.29885	2.424	0.5	4.055	0.5
4	0.01	1	-1.99564	2.674	1	5.739	1
5	0.02	2	-1.6902	2.946	2	8.13	2
6	0.03	3	-1.50965	3.119	3	9.974	3
7	0.04	4	-1.38021	3.249	4	11.54	4
8	0.05	5	-1.27875	3.355	5	12.92	5
9	0.06	6	-1.19498	3.445	6	14.18	6
10	0.07	7	-1.12338	3.524	7	15.34	7
11	0.08	8	-1.0607	3.595	8	16.43	8
12	0.09	9	-1.0048	3.659	9	17.46	9
13	0.1	10	-0.95424	3.718	10	18.43	10
14	0.5	50	0	5	50	45	50
15	0.96	96	1.380211	6.751	96	78.46	96
16	0.97	97	1.50965	6.881	97	80.03	97
17	0.98	98	1.690196	7.054	98	81.87	98
18	0.995	99.5	2.298853	7.576	99.5	85.95	99.5
19	0.9999	99.99	3.999957	8.719	99.99	89.43	99.99
20	0.99999	99.999	4.999996	9.265	99.999	89.82	99.999
21	0.999999	99.9999	6	9.768	100	89.94	99.9999

As an example of the use of transformations, the data from a fictitious dose-response curve (table below) are shown in two curves - first, without transformation and then after transforming the proportion responding to logistic values.

Dose	Proportion	Logistic
1	0.01	-1.99564
2	0.015	-1.81734
3	0.02	-1.6902
4	0.04	-1.38021
5	0.045	-1.32679
6	0.05	-1.27875
7	0.07	-1.12338
8	0.1	-0.95424

9	0.19	-0.62973
10	0.25	-0.47712
11	0.34	-0.28807
12	0.44	-0.10474
13	0.53	0.052178
14	0.62	0.212608

15	0.68	0.327359
16	0.74	0.454258
17	0.79	0.575408
18	0.83	0.688629
19	0.85	0.753328
20	0.88	0.865301
21	0.9	0.954243
22	0.92	1.060698
23	0.935	1.157898
24	0.95	1.278754

Paired data, correlation & regression

Paired Sample t-test
Correlation Coefficient
Pearson's Product Moment Correlation Coefficient
Spearman Rank Correlation Coefficient
Least Squares
Regression Equation
Regression Line

Simple Linear Regression
Multiple Regression
Nonlinear Regression
Residual
Multiple Regression Correlation Coefficient
Stepwise Regression
Dummy Variable (in regression)
Transformation to Linearity

Main Contents page | Index of all entries

Paired Sample t-test A paired sample t-test is used to determine whether there is a significant difference between the average values of the same measurement made under two different conditions. Both measurements are made on each unit in a sample, and the test is based on the paired differences between these two values. The usual null hypothesis is that the difference in the mean values is zero. For example, the yield of two strains of barley is measured in successive years in twenty different plots of agricultural land (the units) to investigate whether one crop gives a significantly greater yield than the other, on average.

The null hypothesis for the paired sample t-test is: H0: d = µ1 - µ2 = 0
where d is the mean value of the difference.

This null hypothesis is tested against one of the following alternative hypotheses, depending on the question posed:: H1: d = 0
H1: d > 0
H1: d < 0

The paired sample t-test is a more powerful alternative to a two sample procedure, such as the two sample t-test, but can only be used when we have matched samples.

Correlation Coefficient A correlation coefficient is a number between -1 and 1 which measures the degree to which two variables are linearly related. If there is perfect linear relationship with positive slope between the two variables, we have a correlation coefficient of 1; if there is positive correlation, whenever one variable has a high (low) value, so does the other. If there is a perfect linear relationship with negative slope between the two variables, we have a correlation coefficient of -1; if there is negative correlation, whenever one variable has a high (low) value, the other has a low (high) value. A correlation coefficient of 0 means that there is no linear relationship between the variables.
There are a number of different correlation coefficients that might be appropriate depending on the kinds of variables being studied.
See also Pearson's Product Moment Correlation Coefficient.
See also Spearman Rank Correlation Coefficient.

Pearson's Product Moment Correlation Coefficient Pearson's product moment correlation coefficient, usually denoted by r, is one example of a correlation coefficient. It is a measure of the linear association between two variables that have been measured on interval or ratio scales, such as the relationship between height in inches and weight in pounds. However, it can be misleadingly small when there is a relationship between the variables but it is a non-linear one.
There are procedures, based on r, for making inferences about the population correlation coefficient. However, these make the implicit assumption that the two variables are jointly normally distributed. When this assumption is not justified, a non-parametric measure such as the Spearman Rank Correlation Coefficient might be more appropriate.
See also correlation coefficient.

Spearman Rank Correlation Coefficient The Spearman rank correlation coefficient is one example of a correlation coefficient. It is usually calculated on occasions when it is not convenient, economic, or even possible to give actual values to variables, but only to assign a rank order to instances of each variable. It may also be a better indicator that a relationship exists between two variables when the relationship is non-linear.
Commonly used procedures, based on the Pearson's Product Moment Correlation Coefficient, for making inferences about the population correlation coefficient make the implicit assumption that the two variables are jointly normally distributed. When this assumption is not justified, a non-parametric measure such as the Spearman Rank Correlation Coefficient might be more appropriate.
See also correlation coefficient.

Least Squares The method of least squares is a criterion for fitting a specified model to observed data. For example, it is the most commonly used method of defining a straight line through a set of points on a scatterplot.
See also regression equation.
See also regression line.

Regression Equation A regression equation allows us to express the relationship between two (or more) variables algebraically. It indicates the nature of the relationship between two (or more) variables. In particular, it indicates the extent to which you can predict some variables by knowing others, or the extent to which some are associated with others.

A linear regression equation is usually written: Y = a + bX + e
where: Y is the dependent variable
a is the intercept
b is the slope or regression coefficient
X is the independent variable (or covariate)
e is the error term

The equation will specify the average magnitude of the expected change in Y given a change in X.
The regression equation is often represented on a scatterplot by a regression line.

Regression Line A regression line is a line drawn through the points on a scatterplot to summarise the relationship between the variables being studied. When it slopes down (from top left to bottom right), this indicates a negative or inverse relationship between the variables; when it slopes up (from bottom right to top left), a positive or direct relationship is indicated.
The regression line often represents the regression equation on a scatterplot.

Simple Linear Regression Simple linear regression aims to find a linear relationship between a response variable and a possible predictor variable by the method of least squares.

Multiple Regression Multiple linear regression aims is to find a linear relationship between a response variable and several possible predictor variables.

Nonlinear Regression Nonlinear regression aims to describe the relationship between a response variable and one or more explanatory variables in a non-linear fashion.

Residual Residual (or error) represents unexplained (or residual) variation after fitting a regression model. It is the difference (or left over) between the observed value of the variable and the value suggested by the regression model.

Multiple Regression Correlation Coefficient The multiple regression correlation coefficient, R², is a measure of the proportion of variability explained by, or due to the regression (linear relationship) in a sample of paired data. It is a number between zero and one and a value close to zero suggests a poor model.
A very high value of R² can arise even though the relationship between the two variables is non-linear. The fit of a model should never simply be judged from the R² value.

Stepwise Regression A 'best' regression model is sometimes developed in stages. A list of several potential explanatory variables are available and this list is repeatedly searched for variables which should be included in the model. The best explanatory variable is used first, then the second best, and so on. This procedure is known as stepwise regression.

Dummy Variable (in regression) In regression analysis we sometimes need to modify the form of non-numeric variables, for example sex, or marital status, to allow their effects to be included in the regression model. This can be done through the creation of dummy variables whose role it is to identify each level of the original variables separately.

Transformation to Linearity Transformations allow us to change all the values of a variable by using some mathematical operation, for example, we can change a number, group of numbers, or an equation by multiplying or dividing by a constant or taking the square root. A transformation to linearity is a transformation of a response variable, or independent variable, or both, which produces an approximate linear relationship between the variables.

e-gloing

Saturday, December 7, 2013

Text and links may be out of date

Paired data, correlation & regression

Top of page | Main Contents page