Week 04 - Measurement

# Week 04 - Measurement
## Missing Observations and Data Visualisation
<html>
<div style="float:left">

</div>
<hr color='#EB811B' size=1px width=800px>
</html>
### Danilo Freire
### 13th February 2019

---

<style>

.remark-slide-number {
  position: inherit;
}

.remark-slide-number .progress-bar-container {
  position: absolute;
  bottom: 0;
  height: 6px;
  display: block;
  left: 0;
  right: 0;
}

.remark-slide-number .progress-bar {
  height: 100%;
  background-color: #EB811B;
}

.orange {
  color: #EB811B;
}
</style>

# Today's Agenda

* Missing data in R: `is.na()` and `na.omit()`

* Visualising data: `barplot()` and `hist()`
]

---

# Tables in R

* Example:
]

```r
afghan <- read.csv("https://raw.githubusercontent.com/pols1600/pols1600.github.io/master/datasets/measurement/afghan.csv")
names(afghan)
```

```
##  [1] "province"            "district"            "village.id"         
##  [4] "age"                 "educ.years"          "employed"           
##  [7] "income"              "violent.exp.ISAF"    "violent.exp.taliban"
## [10] "list.group"          "list.response"
```

```r
table(ISAF = afghan$violent.exp.ISAF,
      Taliban = afghan$violent.exp.taliban)
```

```
##     Taliban
## ISAF    0    1
##    0 1330  354
##    1  475  526
```
]
---

# Tables in R: prop.table()

* We have to include `table()` within parentheses too:
]

```r
prop.table(table(ISAF = afghan$violent.exp.ISAF,
                 Taliban = afghan$violent.exp.taliban))
```

```
##     Taliban
## ISAF         0         1
##    0 0.4953445 0.1318436
##    1 0.1769088 0.1959032
```
]
---
# Tables in R

.font130[
* Since we're already using nested functions, we can also use `round()` to round the values in each cell

* Notice the `, 2` in the code below. It indicates that we will round the numbers up to two significant digits
]

```r
round(prop.table(table(ISAF = afghan$violent.exp.ISAF,
                       Taliban = afghan$violent.exp.taliban)), 2)
```

```
##     Taliban
## ISAF    0    1
##    0 0.50 0.13
##    1 0.18 0.20
```
]

# Tables in R: prop.table()

```r
round(prop.table(table(ISAF = afghan$violent.exp.ISAF,
                       Taliban = afghan$violent.exp.taliban)), 2)
```

```
##     Taliban
## ISAF    0    1
##    0 0.50 0.13
##    1 0.18 0.20
```
]

* And by both the Taliban and ISAF?
]
---

# Tables in R: prop.table()

```r
round(prop.table(table(Employed = afghan$employed,
                       Income = afghan$income)), 3) * 100
```

```
##         Income
## Employed 10,001-20,000 2,001-10,000 20,001-30,000 less than 2,000
##        0           7.6         20.4           1.4            10.2
##        1          16.1         34.2           2.2             7.4
##         Income
## Employed over 30,000
##        0         0.2
##        1         0.3
```
]
---

# Missing Data

* Two types of non-response:
  - Individual non-response
  - Item non-response
  
* Both tend to bias the results

* So it is very important that we know where (and think about why) we see gaps in our data
]
---

# Missing Data

* Since `NA` is only used for missing observations, we can count their numbers with `is.na()`
]

---

# Missing Data

```r
head(afghan$income, n = 10)
```

```
##  [1] 2,001-10,000  2,001-10,000  2,001-10,000  2,001-10,000  2,001-10,000 
##  [6] <NA>          10,001-20,000 2,001-10,000  2,001-10,000  <NA>         
## 5 Levels: 10,001-20,000 2,001-10,000 20,001-30,000 ... over 30,000
```

```r
sum(is.na(afghan$income)) # number of missings
```

```
## [1] 154
```

```r
round(mean(is.na(afghan$income)), 2) # proportion of missings
```

```
## [1] 0.06
```
]
---

# Missing Data

* We add `na.rm = TRUE` to the code
]

```r
# Victims of Taliban violence
sum(is.na(afghan$violent.exp.taliban))
```

```
## [1] 54
```

```r
mean(afghan$violent.exp.taliban)
```

```
## [1] NA
```

```r
round(mean(afghan$violent.exp.taliban, na.rm = TRUE), 2)
```

```
## [1] 0.33
```
]

---

# Missing Data

* Use the `na.omit()` command
]

```r
length(afghan$violent.exp.taliban)
```

```
## [1] 2754
```

```r
taliban.no.missing <- na.omit(afghan$violent.exp.taliban)
length(taliban.no.missing)
```

```
## [1] 2700
```

```r
length(afghan$violent.exp.taliban) - length(taliban.no.missing)
```

```
## [1] 54
```
]

---

# Missing Data

```r
summary(afghan$violent.exp.taliban)
```

```
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0000  0.0000  0.0000  0.3289  1.0000  1.0000      54
```

```r
sum(is.na(afghan$violent.exp.taliban))
```

```
## [1] 54
```
]
---

# Data Visualisation

<html><div style='float:left'></div><hr color='#EB811B' size=1px width=720px></html> 
---

# Bar Plots

* Proportion of observations in each category as the height of each bar

* Options:

- `main = "Title"`
  - `xlab = "X label"`
  - `ylab = "Y label"`
  - `xlim = c(number, number)` limits for the x variable  
  - `ylim = c(number, number)` limits for the y variable  
  - `names.arg = c("Bars labels")` - in the same order of the variable
  - `horiz = TRUE` for horizontal plots
  - `cols = "colour name"` bar colour (see: )
  
* You can use `barplot()` with `prop.table()` instead of pie charts
]
---

# Bar Plots

```r
employed.ptable <- prop.table(table(afghan$employed))
employed.ptable
```

```
## 
##         0         1 
## 0.4172113 0.5827887
```

```r
employed.ptable <- prop.table(table(afghan$employed))
barplot(employed.ptable,
        names.arg = c("Unemployed", "Employed"), 
        main = "Proportion of Employed Afghanis",
        xlab = "Employment",
        ylab = "Proportion",
        ylim = c(0, 0.6))
```
---

# Bar Plots

---

# Bar Plots

```r
barplot(employed.ptable,
        names.arg = c("Unemployed", "Employed"), # 0 and 1, respectively
        main = "Proportion of Employed Afghanis",
        ylab = "Employment", # change the axes
        xlab = "Proportion", 
        xlim = c(0, 0.7), # now it's xlim 
        horiz = TRUE,     # because the plot is horizontal
        col = "brown")
```
---

# Bar Plots

---

# Histograms

* They are similar to bar plots

* Numeric variables are _binned_ into groups

* Histograms shows the density of each bin

* Important: Height is share of observations in bin divided by bin size

* We care less about the density of each bin than about the distribution of the variable as a whole

* Area of each bar is the share of observations that fall into that bin

* Area of all bins sum to one
]
---

# Histograms

* We can also add `freq = FALSE` to show the density of each histograms

* `breaks =` changes the size of the bins

* Densities are useful to compare different distributions

* .orange[Densities are not percentages]: "percentage per horizontal unit"
]
---

# Histograms

```r
# For colours, see: http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf
hist(afghan$age,
     main = "Histogram - Age",
     xlab = "Age",
     xlim = c(0, 0.04),
     freq = FALSE,
     col = "darkorange2") 
```
---

# Histograms

<img src="week04b_files/figure-html/unnamed-chunk-16-1.png" style="display: block; margin: auto;" />
---

# Histograms

* Use `text()` and `abline()` after `hist()`
]

```r
hist(afghan$age,
     main = "Histogram - Age",
     xlab = "Age",
     ylim = c(0, 0.04),
     freq = FALSE,
     col = "darkorange2") 
## add a text label at (x, y) = (35, 0.35)
text(x = 35, y = 0.035, "median")
## add a vertical line representing median
abline(v = median(afghan$age))
```
---

# Histograms

<img src="week04b_files/figure-html/unnamed-chunk-18-1.png" style="display: block; margin: auto;" />
---

# Questions?

<html><div style='float:left'></div><hr color='#EB811B' size=1px width=720px></html> 
---

# Homework

---

# See you on Friday!