Advanced Data Manipulation Homework

(Refer back to the Advanced Data Manipulation lesson).

Key Concepts

dplyr verbs

the pipe %>%

variable creation

multiple conditions

properties of grouped data

aggregation

summary functions

window functions

Getting Started

We’re going to work with a different dataset for the homework here. It’s a cleaned-up excerpt from the Gapminder data. Download the gapminder.csv data by clicking here or using the link above. Download it, and save it in a data/ subfolder of the project directory where you can access it easily from R.

Load the dplyr and readr packages, and read the gapminder data into R using the read_csv() function (n.b. read_csv() is not the same as read.csv()). Assign the data to an object called gm. Run gm to display it.

Note, the code is available by hitting the “Code” button above each expected output, but try not to use it unless you’re stuck.

# Load required libraries
library(dplyr)
library(readr)

# Read the data
gm <- read_csv("data/gapminder.csv")

# Take a look
gm

## # A tibble: 1,704 x 6
##    country     continent  year lifeExp      pop gdpPercap
##    <chr>       <chr>     <int>   <dbl>    <int>     <dbl>
##  1 Afghanistan Asia       1952    28.8  8425333       779
##  2 Afghanistan Asia       1957    30.3  9240934       821
##  3 Afghanistan Asia       1962    32.0 10267083       853
##  4 Afghanistan Asia       1967    34.0 11537966       836
##  5 Afghanistan Asia       1972    36.1 13079460       740
##  6 Afghanistan Asia       1977    38.4 14880372       786
##  7 Afghanistan Asia       1982    39.9 12881816       978
##  8 Afghanistan Asia       1987    40.8 13867957       852
##  9 Afghanistan Asia       1992    41.7 16317921       649
## 10 Afghanistan Asia       1997    41.8 22227415       635
## # ... with 1,694 more rows

Problem set

Use dplyr functions to address the following questions:

How many unique countries are represented per continent? (Hint: group_by then summarize with a call to n_distinct(...)).

gm %>% 
  group_by(continent) %>% 
  summarize(n=n_distinct(country))

## # A tibble: 5 x 2
##   continent     n
##   <chr>     <int>
## 1 Africa       52
## 2 Americas     25
## 3 Asia         33
## 4 Europe       30
## 5 Oceania       2

Which European nation had the lowest GDP per capita in 1997? (Hint: filter, arrange, head(n=1))

gm %>%
    filter(continent == "Europe" & year == 1997) %>%
    arrange(gdpPercap) %>%
    head(1)

## # A tibble: 1 x 6
##   country continent  year lifeExp     pop gdpPercap
##   <chr>   <chr>     <int>   <dbl>   <int>     <dbl>
## 1 Albania Europe     1997    73.0 3428038      3193

According to the data available, what was the average life expectancy across each continent in the 1980s? (Hint: filter, group_by, summarize)

gm %>%
    filter(year == 1982 | year == 1987) %>%
    group_by(continent) %>%
    summarize(mean_lifeExp = mean(lifeExp))

## # A tibble: 5 x 2
##   continent mean_lifeExp
##   <chr>            <dbl>
## 1 Africa            52.5
## 2 Americas          67.2
## 3 Asia              63.7
## 4 Europe            73.2
## 5 Oceania           74.8

What 5 countries have the highest total GDP over all years combined? (Hint: GDP per capita is simply GDP divided by the total population size. To get GDP back, you’d mutate to calculate GDP as the product of GDP per capita times the population size. Mutate, group_by, summarize, arrange, head(n=5))

gm %>%
    mutate(gdp = gdpPercap*pop) %>%
    group_by(country) %>%
    summarise(Total.GDP = sum(gdp)) %>%
    arrange(desc(Total.GDP)) %>%
    head(5)

## # A tibble: 5 x 2
##   country             Total.GDP
##   <chr>                   <dbl>
## 1 United States  76761922948366
## 2 Japan          25434820411847
## 3 China          20395490651693
## 4 Germany        19496885689973
## 5 United Kingdom 13289373709279

What countries and years had life expectancies of at least 80 years? N.b. only output the columns of interest: country, life expectancy and year (in that order).

gm %>%
    filter(lifeExp >= 80) %>%
    select(country, lifeExp, year)

## # A tibble: 22 x 3
##    country          lifeExp  year
##    <chr>              <dbl> <int>
##  1 Australia           80.4  2002
##  2 Australia           81.2  2007
##  3 Canada              80.7  2007
##  4 France              80.7  2007
##  5 Hong Kong, China    80.0  1997
##  6 Hong Kong, China    81.5  2002
##  7 Hong Kong, China    82.2  2007
##  8 Iceland             80.5  2002
##  9 Iceland             81.8  2007
## 10 Israel              80.7  2007
## # ... with 12 more rows

What 10 countries have the strongest correlation (in either direction) between life expectancy and per capita GDP?

gm %>%
    group_by(country) %>%
    summarise(cor = cor(lifeExp, gdpPercap)) %>%
    arrange(desc(abs(cor))) %>%
    head(10)

## # A tibble: 10 x 2
##    country          cor
##    <chr>          <dbl>
##  1 France         0.996
##  2 Austria        0.993
##  3 Belgium        0.993
##  4 Norway         0.992
##  5 Oman           0.991
##  6 United Kingdom 0.990
##  7 Italy          0.990
##  8 Israel         0.988
##  9 Denmark        0.987
## 10 Australia      0.986

Which combinations of continent (besides Asia) and year have the highest average population across all countries? N.b. your output should include all results sorted by highest average population. (Hint: filter where continent != Asia, group by two variables, summarize, then arrange.)

gm %>%
    filter(continent != "Asia") %>%
    group_by(continent, year) %>%
    summarise(mean.pop = mean(pop)) %>%
    arrange(desc(mean.pop))

## # A tibble: 48 x 3
## # Groups:   continent [4]
##    continent  year mean.pop
##    <chr>     <int>    <dbl>
##  1 Americas   2007 35954847
##  2 Americas   2002 33990910
##  3 Americas   1997 31876016
##  4 Americas   1992 29570964
##  5 Americas   1987 27310159
##  6 Americas   1982 25211637
##  7 Americas   1977 23122708
##  8 Americas   1972 21175368
##  9 Europe     2007 19536618
## 10 Europe     2002 19274129
## # ... with 38 more rows

Which three countries have had the most consistent population estimates (i.e. lowest standard deviation) across the years of available data?

gm %>%
    group_by(country) %>%
    summarize(sd.pop = sd(pop)) %>%
    arrange(sd.pop) %>%
    head(3)

## # A tibble: 3 x 2
##   country               sd.pop
##   <chr>                  <dbl>
## 1 Sao Tome and Principe  45906
## 2 Iceland                48542
## 3 Montenegro             99738

Bonus! Which observations indicate that the population of a country has decreased from the previous year and the life expectancy has increased from the previous year? See the vignette on window functions.

gm %>% 
  arrange(country, year) %>% 
  group_by(country) %>% 
  filter(pop < lag(pop) & lifeExp > lag(lifeExp))

## # A tibble: 36 x 6
## # Groups:   country [22]
##    country                continent  year lifeExp      pop gdpPercap
##    <chr>                  <chr>     <int>   <dbl>    <int>     <dbl>
##  1 Afghanistan            Asia       1982    39.9 12881816       978
##  2 Bosnia and Herzegovina Europe     1992    72.2  4256013      2547
##  3 Bosnia and Herzegovina Europe     1997    73.2  3607000      4766
##  4 Bulgaria               Europe     2002    72.1  7661799      7697
##  5 Bulgaria               Europe     2007    73.0  7322858     10681
##  6 Croatia                Europe     1997    73.7  4444595      9876
##  7 Czech Republic         Europe     1997    74.0 10300707     16049
##  8 Czech Republic         Europe     2002    75.5 10256295     17596
##  9 Czech Republic         Europe     2007    76.5 10228744     22833
## 10 Equatorial Guinea      Africa     1977    42.0   192675       959
## # ... with 26 more rows