(Refer back to the Advanced Data Manipulation lesson).
- dplyr verbs
- the pipe
%>%- variable creation
- multiple conditions
- properties of grouped data
- aggregation
- summary functions
- window functions
We’re going to work with a different dataset for the homework here. It’s a cleaned-up excerpt from the Gapminder data. Download the gapminder.csv data by clicking here or using the link above. Download it, and save it in a data/ subfolder of the project directory where you can access it easily from R.
Load the dplyr and readr packages, and read the gapminder data into R using the read_csv() function (n.b. read_csv() is not the same as read.csv()). Assign the data to an object called gm. Run gm to display it.
Note, the code is available by hitting the “Code” button above each expected output, but try not to use it unless you’re stuck.
# Load required libraries
library(dplyr)
library(readr)
# Read the data
gm <- read_csv("data/gapminder.csv")
# Take a look
gm
## # A tibble: 1,704 x 6
## country continent year lifeExp pop gdpPercap
## <chr> <chr> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779
## 2 Afghanistan Asia 1957 30.3 9240934 821
## 3 Afghanistan Asia 1962 32.0 10267083 853
## 4 Afghanistan Asia 1967 34.0 11537966 836
## 5 Afghanistan Asia 1972 36.1 13079460 740
## 6 Afghanistan Asia 1977 38.4 14880372 786
## 7 Afghanistan Asia 1982 39.9 12881816 978
## 8 Afghanistan Asia 1987 40.8 13867957 852
## 9 Afghanistan Asia 1992 41.7 16317921 649
## 10 Afghanistan Asia 1997 41.8 22227415 635
## # ... with 1,694 more rows
Use dplyr functions to address the following questions:
group_by then summarize with a call to n_distinct(...)).gm %>%
group_by(continent) %>%
summarize(n=n_distinct(country))
## # A tibble: 5 x 2
## continent n
## <chr> <int>
## 1 Africa 52
## 2 Americas 25
## 3 Asia 33
## 4 Europe 30
## 5 Oceania 2
head(n=1))gm %>%
filter(continent == "Europe" & year == 1997) %>%
arrange(gdpPercap) %>%
head(1)
## # A tibble: 1 x 6
## country continent year lifeExp pop gdpPercap
## <chr> <chr> <int> <dbl> <int> <dbl>
## 1 Albania Europe 1997 73.0 3428038 3193
gm %>%
filter(year == 1982 | year == 1987) %>%
group_by(continent) %>%
summarize(mean_lifeExp = mean(lifeExp))
## # A tibble: 5 x 2
## continent mean_lifeExp
## <chr> <dbl>
## 1 Africa 52.5
## 2 Americas 67.2
## 3 Asia 63.7
## 4 Europe 73.2
## 5 Oceania 74.8
head(n=5))gm %>%
mutate(gdp = gdpPercap*pop) %>%
group_by(country) %>%
summarise(Total.GDP = sum(gdp)) %>%
arrange(desc(Total.GDP)) %>%
head(5)
## # A tibble: 5 x 2
## country Total.GDP
## <chr> <dbl>
## 1 United States 76761922948366
## 2 Japan 25434820411847
## 3 China 20395490651693
## 4 Germany 19496885689973
## 5 United Kingdom 13289373709279
gm %>%
filter(lifeExp >= 80) %>%
select(country, lifeExp, year)
## # A tibble: 22 x 3
## country lifeExp year
## <chr> <dbl> <int>
## 1 Australia 80.4 2002
## 2 Australia 81.2 2007
## 3 Canada 80.7 2007
## 4 France 80.7 2007
## 5 Hong Kong, China 80.0 1997
## 6 Hong Kong, China 81.5 2002
## 7 Hong Kong, China 82.2 2007
## 8 Iceland 80.5 2002
## 9 Iceland 81.8 2007
## 10 Israel 80.7 2007
## # ... with 12 more rows
gm %>%
group_by(country) %>%
summarise(cor = cor(lifeExp, gdpPercap)) %>%
arrange(desc(abs(cor))) %>%
head(10)
## # A tibble: 10 x 2
## country cor
## <chr> <dbl>
## 1 France 0.996
## 2 Austria 0.993
## 3 Belgium 0.993
## 4 Norway 0.992
## 5 Oman 0.991
## 6 United Kingdom 0.990
## 7 Italy 0.990
## 8 Israel 0.988
## 9 Denmark 0.987
## 10 Australia 0.986
!= Asia, group by two variables, summarize, then arrange.)gm %>%
filter(continent != "Asia") %>%
group_by(continent, year) %>%
summarise(mean.pop = mean(pop)) %>%
arrange(desc(mean.pop))
## # A tibble: 48 x 3
## # Groups: continent [4]
## continent year mean.pop
## <chr> <int> <dbl>
## 1 Americas 2007 35954847
## 2 Americas 2002 33990910
## 3 Americas 1997 31876016
## 4 Americas 1992 29570964
## 5 Americas 1987 27310159
## 6 Americas 1982 25211637
## 7 Americas 1977 23122708
## 8 Americas 1972 21175368
## 9 Europe 2007 19536618
## 10 Europe 2002 19274129
## # ... with 38 more rows
gm %>%
group_by(country) %>%
summarize(sd.pop = sd(pop)) %>%
arrange(sd.pop) %>%
head(3)
## # A tibble: 3 x 2
## country sd.pop
## <chr> <dbl>
## 1 Sao Tome and Principe 45906
## 2 Iceland 48542
## 3 Montenegro 99738
gm %>%
arrange(country, year) %>%
group_by(country) %>%
filter(pop < lag(pop) & lifeExp > lag(lifeExp))
## # A tibble: 36 x 6
## # Groups: country [22]
## country continent year lifeExp pop gdpPercap
## <chr> <chr> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1982 39.9 12881816 978
## 2 Bosnia and Herzegovina Europe 1992 72.2 4256013 2547
## 3 Bosnia and Herzegovina Europe 1997 73.2 3607000 4766
## 4 Bulgaria Europe 2002 72.1 7661799 7697
## 5 Bulgaria Europe 2007 73.0 7322858 10681
## 6 Croatia Europe 1997 73.7 4444595 9876
## 7 Czech Republic Europe 1997 74.0 10300707 16049
## 8 Czech Republic Europe 2002 75.5 10256295 17596
## 9 Czech Republic Europe 2007 76.5 10228744 22833
## 10 Equatorial Guinea Africa 1977 42.0 192675 959
## # ... with 26 more rows