class: center, middle, inverse, title-slide # Week 04 - Measurement ## Missing Observations and Data Visualisation
### Danilo Freire ### 13th February 2019 --- <style> .remark-slide-number { position: inherit; } .remark-slide-number .progress-bar-container { position: absolute; bottom: 0; height: 6px; display: block; left: 0; right: 0; } .remark-slide-number .progress-bar { height: 100%; background-color: #EB811B; } .orange { color: #EB811B; } </style> # Today's Agenda .font150[ * Tables: `table()` and `prop.table()` * Missing data in R: `is.na()` and `na.omit()` * Visualising data: `barplot()` and `hist()` ] --- # Tables in R .font130[ * We have learned how to use the `table()` command in R * Example: ] .code120[ ```r afghan <- read.csv("https://raw.githubusercontent.com/pols1600/pols1600.github.io/master/datasets/measurement/afghan.csv") names(afghan) ``` ``` ## [1] "province" "district" "village.id" ## [4] "age" "educ.years" "employed" ## [7] "income" "violent.exp.ISAF" "violent.exp.taliban" ## [10] "list.group" "list.response" ``` ```r table(ISAF = afghan$violent.exp.ISAF, Taliban = afghan$violent.exp.taliban) ``` ``` ## Taliban ## ISAF 0 1 ## 0 1330 354 ## 1 475 526 ``` ] --- # Tables in R: prop.table() .font130[ * We can also use `prop.table()` to see the proportion of cases in each cell * We have to include `table()` within parentheses too: ] .code120[ ```r prop.table(table(ISAF = afghan$violent.exp.ISAF, Taliban = afghan$violent.exp.taliban)) ``` ``` ## Taliban ## ISAF 0 1 ## 0 0.4953445 0.1318436 ## 1 0.1769088 0.1959032 ``` ] --- # Tables in R .font130[ * Since we're already using nested functions, we can also use `round()` to round the values in each cell * Notice the `, 2` in the code below. It indicates that we will round the numbers up to two significant digits ] .code120[ ```r round(prop.table(table(ISAF = afghan$violent.exp.ISAF, Taliban = afghan$violent.exp.taliban)), 2) ``` ``` ## Taliban ## ISAF 0 1 ## 0 0.50 0.13 ## 1 0.18 0.20 ``` ] .font150[ * The table is now much easier to read and it conveys the same information ] --- # Tables in R: prop.table() .code120[ ```r round(prop.table(table(ISAF = afghan$violent.exp.ISAF, Taliban = afghan$violent.exp.taliban)), 2) ``` ``` ## Taliban ## ISAF 0 1 ## 0 0.50 0.13 ## 1 0.18 0.20 ``` ] .font150[ * What's the percentage of people that have been victimised by the Taliban? * And by both the Taliban and ISAF? ] --- # Tables in R: prop.table() .font150[ * Another example, now with 3 significant digits and multiplied by 100 ] .code120[ ```r round(prop.table(table(Employed = afghan$employed, Income = afghan$income)), 3) * 100 ``` ``` ## Income ## Employed 10,001-20,000 2,001-10,000 20,001-30,000 less than 2,000 ## 0 7.6 20.4 1.4 10.2 ## 1 16.1 34.2 2.2 7.4 ## Income ## Employed over 30,000 ## 0 0.2 ## 1 0.3 ``` ] --- # Missing Data .font150[ * Not all individuals answer to surveys * Two types of non-response: - Individual non-response - Item non-response * Both tend to bias the results * So it is very important that we know where (and think about why) we see gaps in our data ] --- # Missing Data .font150[ * R has a special code for missing data, `NA` * Since `NA` is only used for missing observations, we can count their numbers with `is.na()` ] --- # Missing Data .code120[ ```r head(afghan$income, n = 10) ``` ``` ## [1] 2,001-10,000 2,001-10,000 2,001-10,000 2,001-10,000 2,001-10,000 ## [6] <NA> 10,001-20,000 2,001-10,000 2,001-10,000 <NA> ## 5 Levels: 10,001-20,000 2,001-10,000 20,001-30,000 ... over 30,000 ``` ```r sum(is.na(afghan$income)) # number of missings ``` ``` ## [1] 154 ``` ```r round(mean(is.na(afghan$income)), 2) # proportion of missings ``` ``` ## [1] 0.06 ``` ] --- # Missing Data .font150[ * Some R function don't work if there's missing data * We add `na.rm = TRUE` to the code ] .code120[ ```r # Victims of Taliban violence sum(is.na(afghan$violent.exp.taliban)) ``` ``` ## [1] 54 ``` ```r mean(afghan$violent.exp.taliban) ``` ``` ## [1] NA ``` ```r round(mean(afghan$violent.exp.taliban, na.rm = TRUE), 2) ``` ``` ## [1] 0.33 ``` ] .font150[ * Why do you think we have missing data here? ] --- # Missing Data .font150[ * We can create a new variable or data set without missing observations * Use the `na.omit()` command ] .code150[ ```r length(afghan$violent.exp.taliban) ``` ``` ## [1] 2754 ``` ```r taliban.no.missing <- na.omit(afghan$violent.exp.taliban) length(taliban.no.missing) ``` ``` ## [1] 2700 ``` ```r length(afghan$violent.exp.taliban) - length(taliban.no.missing) ``` ``` ## [1] 54 ``` ] --- # Missing Data .font150[ * You can also visualise the number of missing observations with `summary()` ] .code150[ ```r summary(afghan$violent.exp.taliban) ``` ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's ## 0.0000 0.0000 0.0000 0.3289 1.0000 1.0000 54 ``` ```r sum(is.na(afghan$violent.exp.taliban)) ``` ``` ## [1] 54 ``` ] --- class: inverse, center, middle # Data Visualisation <html><div style='float:left'></div><hr color='#EB811B' size=1px width=720px></html> --- # Bar Plots .font120[ * Bar plots are used to visualise .orange[factor/character variables] * Proportion of observations in each category as the height of each bar * Options: - `main = "Title"` - `xlab = "X label"` - `ylab = "Y label"` - `xlim = c(number, number)` limits for the x variable - `ylim = c(number, number)` limits for the y variable - `names.arg = c("Bars labels")` - in the same order of the variable - `horiz = TRUE` for horizontal plots - `cols = "colour name"` bar colour (see: ) * You can use `barplot()` with `prop.table()` instead of pie charts ] --- # Bar Plots ```r employed.ptable <- prop.table(table(afghan$employed)) employed.ptable ``` ``` ## ## 0 1 ## 0.4172113 0.5827887 ``` ```r employed.ptable <- prop.table(table(afghan$employed)) barplot(employed.ptable, names.arg = c("Unemployed", "Employed"), main = "Proportion of Employed Afghanis", xlab = "Employment", ylab = "Proportion", ylim = c(0, 0.6)) ``` --- # Bar Plots <img src="week04b_files/figure-html/unnamed-chunk-12-1.png" style="display: block; margin: auto;" /> --- # Bar Plots ```r barplot(employed.ptable, names.arg = c("Unemployed", "Employed"), # 0 and 1, respectively main = "Proportion of Employed Afghanis", ylab = "Employment", # change the axes xlab = "Proportion", xlim = c(0, 0.7), # now it's xlim horiz = TRUE, # because the plot is horizontal col = "brown") ``` --- # Bar Plots <img src="week04b_files/figure-html/unnamed-chunk-14-1.png" style="display: block; margin: auto;" /> --- # Histograms .font120[ * We use histograms to display the distribution of a .orange[numeric variable] * They are similar to bar plots * Numeric variables are _binned_ into groups * Histograms shows the density of each bin * Important: Height is share of observations in bin divided by bin size * We care less about the density of each bin than about the distribution of the variable as a whole * Area of each bar is the share of observations that fall into that bin * Area of all bins sum to one ] --- # Histograms .font120[ * Many options are similar to those of `barplot()`: `main`, `xlab`, `ylim`, `col` * We can also add `freq = FALSE` to show the density of each histograms * `breaks =` changes the size of the bins * Densities are useful to compare different distributions * .orange[Densities are not percentages]: "percentage per horizontal unit" ] --- # Histograms ```r # For colours, see: http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf hist(afghan$age, main = "Histogram - Age", xlab = "Age", xlim = c(0, 0.04), freq = FALSE, col = "darkorange2") ``` --- # Histograms <img src="week04b_files/figure-html/unnamed-chunk-16-1.png" style="display: block; margin: auto;" /> --- # Histograms .font150[ * We can also add text and fitted lines to all R plots * Use `text()` and `abline()` after `hist()` ] ```r hist(afghan$age, main = "Histogram - Age", xlab = "Age", ylim = c(0, 0.04), freq = FALSE, col = "darkorange2") ## add a text label at (x, y) = (35, 0.35) text(x = 35, y = 0.035, "median") ## add a vertical line representing median abline(v = median(afghan$age)) ``` --- # Histograms <img src="week04b_files/figure-html/unnamed-chunk-18-1.png" style="display: block; margin: auto;" /> --- class: inverse, center, middle # Questions? <html><div style='float:left'></div><hr color='#EB811B' size=1px width=720px></html> --- # Homework .font150[ * `MEASUREMENT01` ] --- class: inverse, center, middle # See you on Friday! <html><div style='float:left'></div><hr color='#EB811B' size=1px width=720px></html>