§7: Using your calculator to investigate regression
The material in this workshop covers the following Core Assessment Standard:
A S 11.4.1 (b)
Represent bivariate numerical data as a scatter plot and suggest intuitively whether a linear, quadratic or exponential function would best fit the data. (Problems should include issues related to health, social, economic, political and environmental issues.)
A S 12.4.1 (b)
Use available technology to calculate the linear regression function which best fits a given set of bivariate numerical data.
A S 12.4.1 (c)
Use available technology to calculate the correlation coefficient of a set of bivariate numerical data to make relevant deductions.
CORRELATION
Correlation is an indication of possible relationships between variables. We can use the correlation coefficient r to tell us the type of correlation as well as the strength of the correlation.
The value of r ranges between –1 and 1.
· The closer r is to –1 or 1 the stronger the relationship between the variables x and y.
· If r is positive, the slope of the line is positive.
· If r is negative, the slope of the line is negative
If r is close to 1,
a strong positive linear
correlation exists
As x increases, then y also increases / If r is close to –1,
a strong negative linear
relationship exists.
As x increases, then y decreases
· Perfect correlation occurs when all the points fall on the regression line.
11
Written by Jackie Scheiber and Meg Dickson
© RADMASTE Centre, University of the Witwatersrand – May 2007
§7 Using your calculator to investigate regression / FET Data HandlingPRODUCT MOMENT CORRELATION:
The product moment coefficient (sometimes called Pearson’s correlation coefficient) gives an indication of how close the line of best fit is to a straight line.
This coefficient not only shows whether a correlation exists but also attempts to show how close the relationship is between two variables. The coefficient is based on a formula which is the mean product of the deviations from the mean (shown by ) divided by the product of the standard deviations (shown as ).
Where r = the product moment coefficient
x = one of the variables;
y = the other variable
n = the number of items
= the standard deviation of variable x
= standard deviation of variable y
You do not need to know this formula, it is given for interest. You can use a scientific calculator to work out the correlation coefficient.
NONSENSE CORRELATION
If the variables appear to be correlated but common sense dictates that there can be no direct relationship between them then the correlation is called nonsense correlation (or spurious correlation). e.g., the number of live births in South Africa and the number of baobab trees in Limpopo province. There is obviously no relationship between these sets of data!!LINES OF BEST FIT
The line of best fit shows the general trend of the points scattered on the graph.
· The position of this line is estimated - there should be a balance of points on either side of the line.
· In order to draw an accurate line of best fit, often the mean of the data set is plotted, and the line is drawn through it. (The mean of the data set is found by finding the mean of the x-coordinates and the mean of the y-coordinates.)
Statisticians usually require a more ‘exact’ line. They therefore need a more systematic method that always ensures that none of the points is ‘too far away’ from the line. The more closely the line fits the points, the better the line is. One procedure to do this is called the method of least squares.
LEAST SQUARES METHOD:
In this method you find the distance of each point from the line. You then need to minimise this distance.
You calculate the distances by subtracting the y co-ordinates. Since some of the points are above the line and some below the line, some of these distances will be positive and some negative. Squaring each of the little distances makes all numbers positive.
The equation of the line is found by making the total of the squares of the distances as small as possible.
/
In mathematics the equation of a straight-line
graph is usually given as: y = mx + c
Where m is the slope or gradient of the graph
and c is the y intercept.
However, in statistics the equation is usually
written as:
There are formulae to work out values of a and b in the linear regression line
y = a + bx. They are:
And , where and are the arithmetic means of the two variables.
The regression line equation y = a + bx can also be rewritten as: y =
· As you can see the formulae for working out the values of r (correlation coefficient) and a and b are very complicated.
· They are beyond the scope of the NCS, which calls for an intuitive understanding only.
Scientific calculators and computer spreadsheets provide a much simpler way of working with these complex formulae.
· On a scientific calculator you can work out the equation of the regression line and the correlation coefficient.
· On a computer you can draw the scatter plot, work out the correlation coefficient and the equation of the regression line.
NON LINEAR CORRELATION AND REGRESSION
This graph shows a scatter plot for the temperature of a cooling cup of coffee and the time taken to cool.
Notice that the points do not form a straight line. Instead they form a curve – in this case an exponential curve with an equation of:
y = 76.217e-0.0096x
Just as the method of least squares can be used to find the linear regression line it can also be used to find the equation of a quadratic or exponential regression line and the corresponding correlation coefficients.
The calculations are even more complex than those for the linear regression line, so using a computer or calculator is very useful.
USING A SCIENTIFIC CALCULATOR TO WORK OUT LINEAR CORRELATION AND THE EQUATION OF REGRESSION LINES
The calculator for which these instructions correspond is the CASIO fx-82ES.
This calculator display shows numbers and fractions. It also allows you to enter data in table form. You can enter single variable data (x) and double variable data (x ; y).
Before you begin you need to know how to get into the stats mode on the calculator.
Use the [MODE] key:
[MODE] [2:STAT]
The following screen will appear: Meanings of keys
1 / 1-VAR / 2 / A + BX / Key / Menu Item / Statistical Calculation
3 / _ + CX2 / 4 / ln X / 1. / 1-VAR / Single variable
5 / e ^ X / 6 / A . B ^ X / 2. / A + BX / Linear regression
7 / A . X ^ B / 8 / 1/X / 3. / _ + CX2 / Quadratic regression
4. / ln X / Logarithmic regression
5. / e ^ X / Exponential regression
6. / A . B ^ X / Ab exponential regression
7. / A . X ^ B / Power regression
8. / 1/X / Inverse regression
UNIVARIATE DATA
Press [1] for single variable data and a table like this will appear on the screen
x
1 …….…….
2
3
· The calculator is now ready for you to input values of x.
· To enter a data item, enter the number and then press [=]
Entering data items:
· You input data into the cell where the cursor is located.
· You use the REPLAY ARROWS to move the cursor between the cells.
· To enter 135,2 into cell x1
○ Move cursor to x1
○ Enter numbers
x
1 …….…….
2
3
135,2
○ Press[=]
x
1 135,2
2 …….…….
3
To edit data:
· Move cursor to cell you want to edit:
· Input new data and press [=]
To delete a line:
· Move cursor to the line you want to delete
· Press [DEL]
To insert a line:
· Move cursor to line that will be under the line you want to insert.
· Press [SHIFT] [1] (STAT) [3] (edit)
· Press [1] (ins)
To delete all STAT contents
· Press [SHIFT] [1] (STAT) [3] (edit)
· Press [2] (Del-A)
Meaning of common items in single variable menu
Sum sub menu [SHIFT] [1] [4: SUM]
Key / Menu item / What you want to calculate
1 / Sum of squares of sample data
2 / Sum of the sample data
VAR sub menu [SHIFT] [1] [5: VAR]
Key / Menu item / What you want to calculate
1 / n / Number of items
2 / Mean of sample data
3 / xn / Population standard deviation
4 / xn - 1 / Sample standard deviation
EXAMPLE 1
Enter these temperatures, and then use the calculator to calculate the mean, standard deviation, sum of x and the number of elements / Temperature(°C)
10
15
20
25
30
x
SOLUTION:
KEYS TO PRESS / DISPLAY1) Get into the STATS MODE FOR SINGLE VARIABLE DATA / [MODE] [2: STAT]
[1: 1-VAR]
2) To ENTER DATA into a single variable table / [10] [=]
[15] [=]
[20] [=]
[25] [=]
[30] [=]
[AC]
3) To find the MEAN / [SHIFT] [1]
[5: VAR]
[2: ] [=] / = 20
4) To calculate the STANDARD DEVIATION / [AC]
[SHIFT] [1]
[5: VAR] [3: xn] [=] / xn = 7,071
5) To calculate the SUM OF x / [AC]
[SHIFT] [1]
[4:SUM] [2: ] [=] / = 100
6) To find the NUMBER OF ELEMENTS in the sample / [AC]
[SHIFT] [1]
[5:VAR] [1: n] [=] / n = 5
BIVARIATE DATA
Use the [MODE] key to get into the stats mode on the calculator.
Press [2] for double variable data and a table like this will appear on the screen.
x y
1 …….……. …….…….
2
3
The calculator is now ready for you to input values of x and y.
As before, use [=] to enter all the data items.
All keys [2] to [8] give double variable data.
EXAMPLE 2
Suppose you wanted to investigate whether there is a linear relationship between temperature and atmospheric pressure. You collect data as shown in the table below:
Temperature (°C)
/ Atmospheric pressure (KPa)10 / 100,3
15 / 100,5
20 / 101,0
25 / 101,1
30 / 101,4
x / y
SOLUTION:
1) Get into stats mode for double variable data / [MODE] [2: STAT][2: A + BX]
2) To enter data into double-variable table / Input x values first and then y values.
Use the arrows on the REPLAY button to move the cursor to the y column
For x For y
[10] [=] [100,3] [=]
[15] [=] [100,5] [=]
[20] [=] [101,0] [=]
[25] [=] [101,1] [=]
[30] [=] [101,4] [=]
[AC]
3) To calculate the correlation coefficient / [SHIFT] [1] [7: Reg]
[3: r] [=] / r = 0,9826073689
· Once an acceptable relationship has been found between the variables, the equation of the regression line can be found.
If you know the equation of the regression line you can forecast the atmospheric pressure for other temperatures, or the temperature for other pressures.
We can use the calculator to work out the values of a and b in the equation y = a + bx.
KEYS TO PRESS / DISPLAYTo calculate the value of A / [SHIFT] [1] [7: Reg]
[1: A] [=] / A = 99,74
To calculate the value of B / [SHIFT] [1] [7: Reg]
[2: B] [=] / B = 0,056
The equation of the line:
· The equation of the regression line is: y = 99,74 + 0,056x
Once the regression line is defined, the calculator allows you to make projections.
For example: Find
1) the temperature if the atmospheric pressure is 100 kPa
2) the atmospheric pressure when the temperature is 18° C.
KEYS TO PRESS / DISPLAY / WHAT THIS MEANSTo find the temperature when the pressure = 100 hPa: / [100]
[SHIFT] [1] [7: Reg]
[4: ] [=] / = 4.642857143 / This means the temperature = 4,6°C (to 1 decimal place) when the pressure is 100 kPa
To find the atmospheric pressure when the temperature is 18° C / [18]
[SHIFT] [1] [7: Reg]
[5: ] [=] / = 100.748 / This means that the pressure = 100,748 kPa when the temperature is 18°C
Quadratic and exponential regression
You can also use the calculator to find the equation of a quadratic or an exponential regression line.
Exponential regression: The equation of an exponential regression line is y = A e B x , and the keys to press are [MODE] [2:STAT] [5:e^X]
Quadratic regression: The formula for a quadratic regression line is y = A + Bx + Cx2, and the keys to press are [MODE] [2: STAT] [3:__+CX2]
Look in your calculator manual for other information on using the calculator.
EXERCISE
1) Is there a relationship between the maths and science marks of learners in a class? Could we expect that learners who get high marks in science would also get high marks in maths?
Learner / Maths mark x% / Science mark y
%
1 / 65 / 60
2 / 45 / 60
3 / 40 / 55
4 / 55 / 70
5 / 60 / 80
6 / 50 / 40
7 / 80 / 85
8 / 30 / 50
9 / 70 / 70
10 / 65 / 80
a) A calculator was used, and the following values were found: A = 25,2 B = 0,7
r = 0,74. What do these values mean?
b) A scatter graph was drawn to illustrate this information. Draw in the line of best fit.
2) Ten people were asked a set of test questions designed to measure their attitudes to television as a news medium, and a further set to measure their attitude to newspapers. A higher overall score shows greater satisfaction. The scores are shown in the following table.
a) Calculate the correlation coefficient between the two scores.
b) Draw a scatter diagram on the squared paper on the next page to illustrate them.
c) Calculate the regression line and draw it in.