|
Paired data, correlation & regression
Paired Sample t-test
A paired sample t-test is used to determine whether there is a
significant difference between the average values of the same
measurement made under two different conditions. Both measurements are
made on each unit in a sample, and the test is based on the paired
differences between these two values. The usual null hypothesis is that
the difference in the mean values is zero. For example, the yield of
two strains of barley is measured in successive years in twenty
different plots of agricultural land (the units) to investigate whether
one crop gives a significantly greater yield than the other, on average.
- The null hypothesis for the paired sample t-test is
- H0: d = µ1 - µ2 = 0
- where d is the mean value of the difference.
- This null hypothesis is tested against one of the following alternative hypotheses, depending on the question posed:
- H1: d = 0
H1: d > 0
H1: d < 0
The paired sample t-test is a more powerful alternative to a two sample
procedure, such as the two sample t-test, but can only be used when we have matched samples.
Correlation Coefficient
A correlation coefficient is a number between -1 and 1 which measures
the degree to which two variables are linearly related. If there is
perfect linear relationship with positive slope between the two
variables, we have a correlation coefficient of 1; if there is positive
correlation, whenever one variable has a high (low) value, so does the
other. If there is a perfect linear relationship with negative slope
between the two variables, we have a correlation coefficient of -1; if
there is negative correlation, whenever one variable has a high (low)
value, the other has a low (high) value. A correlation coefficient of 0
means that there is no linear relationship between the variables.
There are a number of different correlation coefficients that might
be appropriate depending on the kinds of variables being studied.
See also Pearson's Product Moment
Correlation Coefficient.
See also Spearman Rank Correlation
Coefficient.
Pearson's Product Moment Correlation Coefficient
Pearson's product moment correlation coefficient, usually denoted by
r, is one example of a correlation coefficient. It is a measure of the
linear association between two variables that have been measured on
interval or ratio scales, such as the relationship between height in
inches and weight in pounds. However, it can be misleadingly small when
there is a relationship between the variables but it is a non-linear
one.
There are procedures, based on r, for making inferences about the
population correlation coefficient. However, these make the implicit
assumption that the two variables are jointly normally distributed. When
this assumption is not justified, a non-parametric measure such as the Spearman Rank Correlation Coefficient might be more appropriate.
See also correlation coefficient.
Spearman Rank Correlation Coefficient
The Spearman rank correlation coefficient is one example of a
correlation coefficient. It is usually calculated on occasions when it
is not convenient, economic, or even possible to give actual values to
variables, but only to assign a rank order to instances of each
variable. It may also be a better indicator that a relationship exists
between two variables when the relationship is non-linear.
Commonly used procedures, based on the Pearson's Product Moment Correlation Coefficient,
for making inferences about the population correlation coefficient make
the implicit assumption that the two variables are jointly normally
distributed. When this assumption is not justified, a non-parametric
measure such as the Spearman Rank Correlation Coefficient might be more
appropriate.
See also correlation coefficient.
Least Squares
The method of least squares is a criterion for fitting a specified
model to observed data. For example, it is the most commonly used method
of defining a straight line through a set of points on a scatterplot.
See also regression equation.
See also regression line.
Regression Equation
A regression equation allows us to express the relationship between
two (or more) variables algebraically. It indicates the nature of the
relationship between two (or more) variables. In particular, it
indicates the extent to which you can predict some variables by knowing
others, or the extent to which some are associated with others.
- A linear regression equation is usually
written
- Y = a + bX + e
- where
- Y is the dependent variable
a is the intercept
b is the slope or regression coefficient
X is the independent variable (or covariate)
e is the error term
The equation will specify the average magnitude of the expected change in Y given a change in X.
The regression equation is often represented on a scatterplot by a regression line.
Regression Line
A regression line is a line drawn through the points on a scatterplot
to summarise the relationship between the variables being studied. When
it slopes down (from top left to bottom right), this indicates a
negative or inverse relationship between the variables; when it slopes
up (from bottom right to top left), a positive or direct relationship is
indicated.
The regression line often represents the regression equation on a scatterplot.
Simple Linear Regression
Simple linear regression aims to find a linear relationship between a
response variable and a possible predictor variable by the method of
least squares.
Multiple Regression
Multiple linear regression aims is to find a linear relationship
between a response variable and several possible predictor variables.
Nonlinear Regression
Nonlinear regression aims to describe the relationship between a
response variable and one or more explanatory variables in a non-linear
fashion.
Residual
Residual (or error) represents unexplained (or residual) variation
after fitting a regression model. It is the difference (or left over)
between the observed value of the variable and the value suggested by
the regression model.
Multiple Regression Correlation Coefficient
The multiple regression correlation coefficient, R², is a measure of
the proportion of variability explained by, or due to the regression
(linear relationship) in a sample of paired data. It is a number
between zero and one and a value close to zero suggests a poor model.
A very high value of R² can arise even though the relationship
between the two variables is non-linear. The fit of a model should never
simply be judged from the R² value.
Stepwise Regression
A 'best' regression model is sometimes developed in stages. A list of
several potential explanatory variables are available and this list is
repeatedly searched for variables which should be included in the model.
The best explanatory variable is used first, then the second best, and
so on. This procedure is known as stepwise regression.
Dummy Variable (in regression)
In regression analysis we sometimes need to modify the form of
non-numeric variables, for example sex, or marital status, to allow
their effects to be included in the regression model. This can be done
through the creation of dummy variables whose role it is to identify
each level of the original variables separately.
Transformation to Linearity
Transformations allow us to change all the values of a variable by
using some mathematical operation, for example, we can change a number,
group of numbers, or an equation by multiplying or dividing by a
constant or taking the square root. A transformation to linearity is a
transformation of a response variable, or independent variable, or both,
which produces an approximate linear relationship between the
variables.
|
No comments:
Post a Comment