Week 07 - Regression

# Week 07 - Regression
## Introduction to Linear Models
<html>
<div style="float:left">

</div>
<hr color='#EB811B' size=1px width=800px>
</html>
### Danilo Freire
### 6th March 2019

---

<style>

.remark-slide-number {
  position: inherit;
}

.remark-slide-number .progress-bar-container {
  position: absolute;
  bottom: 0;
  height: 6px;
  display: block;
  left: 0;
  right: 0;
}

.remark-slide-number .progress-bar {
  height: 100%;
  background-color: #EB811B;
}

.orange {
  color: #EB811B;
}
</style>

# Today's Agenda

* From correlation to linear regression

* Regression equations
]
---

# Correlations and Scatter Plots

```r
bivariate <- read.csv("https://raw.githubusercontent.com/pols1600/pols1600.github.io/master/datasets/prediction/bivariate_data.csv")
bivariate <- subset(bivariate, Year == 2010)
plot(log(bivariate$GDP), bivariate$Child.Mortality, pch = 19, 
     main = "Log GDP Per Capita and Child Mortality")
```

```r
cor(bivariate$Child.Mortality, log(bivariate$GDP), use = "complete")
```

```
## [1] -0.7661286
```
---

# Correlations

* `$cor_{x,y} = \frac{1}{n-1} \sum^{n}_{i=1} \big(\frac{x_i - \bar{x}}{S_x} \times \frac{y_i - \bar{y}}{S_y}\big)$`

* Positive correlation: upward slope

* Negative correlation: downward slope

* High correlation: tighter, close to a line

* Correlation .orange[cannot] capture nonlinear relationship
]
---

# Correlations and Scatter Plots

# From Correlation to Linear Regression

* Linear regression specifies direction of relationship

* Linear regression allows us to examine more than two variables at the same time (statistical control)
]
---

# Linear Regression

.font150[
* Regression has .orange[one dependent (y)] and _for now_ .orange[one independent (x) variable]

* Goal of regression is to approximate the (linear) relationship between X and Y as best as possible

* Regression is the mathematical model to draw best fitting line through cloud of points

* In R: `lm(Y ~ X, data = yourdataset)`
]
---

# Linear Regression

```r
plot(log(bivariate$GDP), bivariate$Child.Mortality, pch = 19, 
     main = "Log GDP Per Capita and Child Mortality",
     xlab = "Log GDP Per Capita", ylab = "Child Mortality")
*abline(lm(Child.Mortality ~ log(GDP), data = bivariate), col = "red", lwd = 2)
```

<img src="week07b_files/figure-html/gap02-1.png" style="display: block; margin: auto;" />
]

.font120[For each x we have a prediction of y: .orange[what would we expect y to be given the value of x?]
]
---

# What is the equation of a line?

.font150[
* `$y = mx + b$`
  - `$m$` is the slope
  - `$b$` is the intercept
  
* .orange[Regression equation]: `$Y = \alpha + \beta X + \epsilon$`
  - `$\alpha$` is the intercept
  - `$\beta$` is the slope
  - `$\epsilon$` is the error term
]
---

# Interpretation

* `$\alpha$`: value of `$Y$` when `$X = 0$`

* `$\beta$`: increase/decrease in `$Y$` when `$X$` varies by one unit

* `$\alpha + \beta X$`: average value of `$Y$` at a given value of `$X$`

* `$\epsilon$`: difference between each point and the fitted line. Assumed to have a mean of zero

* **BUT..**
]
---

# Least Squares

* So we have to _estimate_ the equation using collected data

* Estimates are denoted with little hats: `$\hat{\alpha}$`, `$\hat{\beta}$` (read as "alpha hat", "beta hat")

* We can use `$\hat{\alpha}$`, `$\hat{\beta}$` and `$X$` to create _predicted values of_ `$Y$`

* `$\hat{Y} = \hat{\alpha} + \hat{\beta}X$`, predicted/fitted value

* `$\hat{\epsilon} = Y - \hat{Y} =$` true `$Y$` - predicted `$Y$`

* `$\hat{\epsilon}$` is a good measure of model fit
]
---

# How Do We Estimate the Parameters?

`$$\textsf{SSR} = \sum_{i=1}^n \hat\epsilon_i^2 = \sum_{i=1}^n (Y_i - \hat{Y_{i}})^2 = \sum_{i=1}^n (Y_i - \hat\alpha - \hat\beta X_i)^2$$`
* This also minimises the root mean squared error: `$RMSE = \sqrt{\frac{1}{n}\textsf{SSR}}$`
]
---

# Regression by Hand

* `$\hat\beta = \frac{\sum_{i=1}^n (Y_i - \overline{Y})(X_i - \overline{X})}{\sum_{i=1}^n (X_i - \overline{X})^2}$`

OR...
]
---

# Regression by Hand

* `$\hat\beta = \textsf{correlation of X and Y} \times \frac{\textsf{standard deviation of Y}}{\textsf{standard deviation of X}}$`

* Regression line always goes through the point of averages `$\hat{X},\hat{Y}$`

* `$\hat{Y}  = (\overline{Y} - \hat\beta \overline{X}) + \hat\beta \overline{X} = \overline{Y}$`
]
---

# Example

```r
lm(Child.Mortality ~ log(GDP), data = bivariate)
```

```
## 
## Call:
## lm(formula = Child.Mortality ~ log(GDP), data = bivariate)
## 
## Coefficients:
## (Intercept)     log(GDP)  
##      276.58       -26.13
```
* `$Y = 276.58 - 26.13X + \epsilon$`

* `$Y$` is child mortality in 100,000 births

* .orange[How can we interpret the equation?]
]
---

# GDP Per Capita and Life Expectancy

```r
library(gapminder)
gapminder$gdppc1000 <- gapminder$gdpPercap/1000 # use $1,000s as unit
lm(lifeExp ~ gdppc1000, data = gapminder)   # life expectancy ~ GDPpc
```

```
## 
## Call:
## lm(formula = lifeExp ~ gdppc1000, data = gapminder)
## 
## Coefficients:
## (Intercept)    gdppc1000  
##     53.9556       0.7649
```
How would you interpret that result?
]
---

# Questions?

<html><div style='float:left'></div><hr color='#EB811B' size=1px width=720px></html>  
---

# Homework

# See you on Friday!