**Regression and correlation**

Correlation can be used to examine whether there is a connection between two variables in a statistic. That is, whether the value of one affects the other. Such a relationship is called correlation. If the values of the second variable increase and, as a result, the values of the other also increase, the correlation is positive. If the other decreases as the other increases, the correlation is negative.

**Example 1**

Below are the grades of Liisa-Petter's short maths courses and the number of exercises she did before the exam. Determine whether the number of exercises and the course grade depend on each other by determining the correlation coefficient.

We enter the grade data into LibreOffice Calc and write a command for the correlation

The correlation coefficient is 0,43

If the coefficient is between 0,0 ... 0,3 or -0,3 ... 0,0, the correlation is insignificant, and number of exercises have no effect on the grades. Between 0,3 ... 0,6 or -0,6 ...- 0,3 the correlation is moderate, i.e. exercises may have some effect on the grade. Between 0,6 ... 0,8 or -0,8 ...- 0,6 the correlation is considerable and there is an effect on the grades. The clear connection, i.e. the perfect correlation, is between 0,8-1,0 and -1,0 ...- 0,8.

That is, exercises completed seem to have some connection to the grade

**Example 2**

Information about their preparation for a maths test was collected from ten participants. They were asked for the number of hours they spent studying before the test.

On the table are listed their preparation times as well as their test score.

There is a strong correlation between the preparation time and the score *r = 0,94*, i.e. a longer preparation time produced a better result in the test. The explanation rate is 88%. That is, the preparation time explains 88% of the score variation.

Use LibreOffice to determine the equation of the regression line and predict how many points you would get with 13 hours of preparation.

The regression line y = 2.94⋅x + 3.9 and the score for 13-hour preparation is 42

The regression line can be used to make predictions. A regression line can be formed when there is a dependency relationship between the variables, i.e. the correlation is significant.

Creating a regression line with LibreOffice Calc is covered in the video below.

**Turn on the subtitles if needed**

## Outliers

If a completely different observation is found in the data, it may be, for example, a measurement error. Such a value is removed before the regression line is created.

**Example 3.**

Lisa-Peter measured her happiness on a 12-point scale. She measured the value every hour after waking up. The measurements were made on a day when nothing special happened. Nothing bad nor happy. The results are in the table below.

**All data**

**All data**

A correlation of 0,6 shows only a moderate relationship, but happiness measured 9 hours after waking up is completely different from other observations. This is probably a measurement error.

**Outlier removed**

**Outlier removed**

When the outlier is removed, the correlation is a significant 0,91

**Exercises**

## Exercise 1

Above are the grades for Liasa-Petter's short maths courses and the number of exercises she completed before the exam. Determine whether the number of exercises and the course grade depend on each other by determining the correlation coefficient.

Use either LibreOffice or Geogebra for this.

## Exercise 2

Above is the data for two statistical variables. Determine if there is a dependency between the variables by determining the correlation coefficient.

Use either LibreOffice or Geogebra for this.

## Exercise 3

Ten individuals were asked about age and asked to rate their happiness on a scale of 0-100. Determine the correlation coefficient and find out whether there is a relationship between age and the feeling of happiness.

Use either Libre Office or Geogebra for this.

## Exercise 4

The table illustrates the relationship between variables X and Y. What kind of dependency is between them?

Also explain the correlation coefficient between the variables. Justify your answer.