Tidy Data and Advanced Data Manipulation

Recommended reading prior to class: Sections 1-3 of Wickham, H. “Tidy Data.” Journal of Statistical Software 59:10 (2014).

Review

Prior classes

Data needed

Hit the “Data” link above or use the direct links below to download the following datasets, saving them in a data folder relative to your current working RStudio project:

Heart rate data: heartrate2dose.csv
Tidy yeast data: brauer2007_tidy.csv
Original (untidy) yeast data: brauer2007_messy.csv
Yeast systematic names to GO terms: brauer2007_sysname2go.csv

Tidy data

So far we’ve dealt exclusively with tidy data – data that’s easy to work with, manipulate, and visualize. That’s because our dataset has two key properties:

Each column is a variable.
Each row is an observation.

You can read a lot more about tidy data in this paper. Let’s load some untidy data and see if we can see the difference. This is some made-up data for five different patients (Jon, Ann, Bill, Kate, and Joe) given three different drugs (A, B, and C), at two doses (10 and 20), and measuring their heart rate. Download the heartrate2dose.csv file directly from the data downloads page. Load readr and dplyr, and import and display the data.

library(readr)
library(dplyr)
hr <- read_csv("data/heartrate2dose.csv")
hr

## # A tibble: 5 x 7
##   name   a_10  a_20  b_10  b_20  c_10  c_20
##   <chr> <int> <int> <int> <int> <int> <int>
## 1 jon      60    55    65    60    70    70
## 2 ann      65    60    70    65    75    75
## 3 bill     70    65    75    70    80    80
## 4 kate     75    70    80    75    85    85
## 5 joe      80    75    85    80    90    90

Notice how with the yeast data each variable (symbol, nutrient, rate, expression, etc.) were each in their own column. In this heart rate data, we have four variables: name, drug, dose, and heart rate. Name is in a column, but drug is in the header row. Furthermore the drug and dose are tied together in the same column, and the heart rate is scattered around the entire table. If we wanted to do things like filter the dataset where drug=="a" or dose==20 or heartrate>=80 we couldn’t do it because these variables aren’t in columns.

The tidyr package

The tidyr package helps with this. There are several functions in the tidyr package but the ones we’re going to use are separate() and gather(). The gather() function takes multiple columns, and gathers them into key-value pairs: it makes “wide” data longer. The separate() function separates one column into multiple columns. So, what we need to do is gather all the drug/dose data into a column with their corresponding heart rate, and then separate that column into two separate columns for the drug and dose.

Before we get started, load the tidyr package, and look at the help pages for ?gather and ?separate. Notice how each of these functions takes a data frame as input and returns a data frame as output. Thus, we can pipe from one function to the next.

library(tidyr)

`gather()`

The help for ?gather tells us that we first pass in a data frame (or omit the first argument, and pipe in the data with %>%). The next two arguments are the names of the key and value columns to create, and all the relevant arguments that come after that are the columns we want to gather together. Here’s one way to do it.

hr %>% gather(key=drugdose, value=hr, a_10, a_20, b_10, b_20, c_10, c_20)

## # A tibble: 30 x 3
##    name  drugdose    hr
##    <chr> <chr>    <int>
##  1 jon   a_10        60
##  2 ann   a_10        65
##  3 bill  a_10        70
##  4 kate  a_10        75
##  5 joe   a_10        80
##  6 jon   a_20        55
##  7 ann   a_20        60
##  8 bill  a_20        65
##  9 kate  a_20        70
## 10 joe   a_20        75
## # ... with 20 more rows

But that gets cumbersome to type all those names. What if we had 100 drugs and 3 doses of each? There are two other ways of specifying which columns to gather. The help for ?gather tells you how to do this:

... Specification of columns to gather. Use bare variable names. Select all variables between x and z with x:z, exclude y with -y. For more options, see the select documentation.

So, we could accomplish the same thing by doing this:

hr %>% gather(key=drugdose, value=hr, a_10:c_20)

## # A tibble: 30 x 3
##    name  drugdose    hr
##    <chr> <chr>    <int>
##  1 jon   a_10        60
##  2 ann   a_10        65
##  3 bill  a_10        70
##  4 kate  a_10        75
##  5 joe   a_10        80
##  6 jon   a_20        55
##  7 ann   a_20        60
##  8 bill  a_20        65
##  9 kate  a_20        70
## 10 joe   a_20        75
## # ... with 20 more rows

But what if we didn’t know the drug names or doses, but we did know that the only other column in there that we don’t want to gather is name?

hr %>% gather(key=drugdose, value=hr, -name)

## # A tibble: 30 x 3
##    name  drugdose    hr
##    <chr> <chr>    <int>
##  1 jon   a_10        60
##  2 ann   a_10        65
##  3 bill  a_10        70
##  4 kate  a_10        75
##  5 joe   a_10        80
##  6 jon   a_20        55
##  7 ann   a_20        60
##  8 bill  a_20        65
##  9 kate  a_20        70
## 10 joe   a_20        75
## # ... with 20 more rows

`separate()`

Finally, look at the help for ?separate. We can pipe in data and omit the first argument. The second argument is the column to separate; the into argument is a character vector of the new column names, and the sep argument is a character used to separate columns, or a number indicating the position to split at.

Side note, and 60-second lesson on vectors: We can create arbitrary-length vectors, which are simply variables that contain an arbitrary number of values. To create a numeric vector, try this: c(5, 42, 22908). That creates a three element vector. Try c("cat", "dog").

hr %>% 
  gather(key=drugdose, value=hr, -name) %>% 
  separate(drugdose, into=c("drug", "dose"), sep="_")

## # A tibble: 30 x 4
##    name  drug  dose     hr
##    <chr> <chr> <chr> <int>
##  1 jon   a     10       60
##  2 ann   a     10       65
##  3 bill  a     10       70
##  4 kate  a     10       75
##  5 joe   a     10       80
##  6 jon   a     20       55
##  7 ann   a     20       60
##  8 bill  a     20       65
##  9 kate  a     20       70
## 10 joe   a     20       75
## # ... with 20 more rows

`%>%` it all together

Let’s put it all together with gather %>% separate %>% filter %>% group_by %>% summarize.

If we create a new data frame that’s a tidy version of hr, we can do those kinds of manipulations we talked about before:

# Create a new data.frame
hrtidy <- hr %>% 
  gather(key=drugdose, value=hr, -name) %>% 
  separate(drugdose, into=c("drug", "dose"), sep="_")

# Optionally, view it
# View(hrtidy)

# filter
hrtidy %>% filter(drug=="a")

## # A tibble: 10 x 4
##    name  drug  dose     hr
##    <chr> <chr> <chr> <int>
##  1 jon   a     10       60
##  2 ann   a     10       65
##  3 bill  a     10       70
##  4 kate  a     10       75
##  5 joe   a     10       80
##  6 jon   a     20       55
##  7 ann   a     20       60
##  8 bill  a     20       65
##  9 kate  a     20       70
## 10 joe   a     20       75

hrtidy %>% filter(dose==20)

## # A tibble: 15 x 4
##    name  drug  dose     hr
##    <chr> <chr> <chr> <int>
##  1 jon   a     20       55
##  2 ann   a     20       60
##  3 bill  a     20       65
##  4 kate  a     20       70
##  5 joe   a     20       75
##  6 jon   b     20       60
##  7 ann   b     20       65
##  8 bill  b     20       70
##  9 kate  b     20       75
## 10 joe   b     20       80
## 11 jon   c     20       70
## 12 ann   c     20       75
## 13 bill  c     20       80
## 14 kate  c     20       85
## 15 joe   c     20       90

hrtidy %>% filter(hr>=80)

## # A tibble: 10 x 4
##    name  drug  dose     hr
##    <chr> <chr> <chr> <int>
##  1 joe   a     10       80
##  2 kate  b     10       80
##  3 joe   b     10       85
##  4 joe   b     20       80
##  5 bill  c     10       80
##  6 kate  c     10       85
##  7 joe   c     10       90
##  8 bill  c     20       80
##  9 kate  c     20       85
## 10 joe   c     20       90

# analyze
hrtidy %>%
  filter(name!="joe") %>% 
  group_by(drug, dose) %>%
  summarize(meanhr=mean(hr))

## # A tibble: 6 x 3
## # Groups:   drug [?]
##   drug  dose  meanhr
##   <chr> <chr>  <dbl>
## 1 a     10      67.5
## 2 a     20      62.5
## 3 b     10      72.5
## 4 b     20      67.5
## 5 c     10      77.5
## 6 c     20      77.5

Tidy the yeast data

Now, let’s take a look at the yeast data again. The data we’ve been working with up to this point was already cleaned up to a good degree. All of our variables (symbol, nutrient, rate, expression, GO terms, etc.) were each in their own column. Make sure you have the necessary libraries loaded, and read in the tidy data once more into an object called ydat.

# Load libraries
library(readr)
library(dplyr)
library(tidyr)

# Import data
ydat <- read_csv("data/brauer2007_tidy.csv")

# Optionally, View
# View(ydat)

# Or just display to the screen
ydat

## # A tibble: 198,430 x 7
##    symbol systematic_name nutrient   rate expression bp        mf         
##    <chr>  <chr>           <chr>     <dbl>      <dbl> <chr>     <chr>      
##  1 SFB2   YNL049C         Glucose  0.0500    -0.240  ER to Go… molecular …
##  2 <NA>   YNL095C         Glucose  0.0500     0.280  biologic… molecular …
##  3 QRI7   YDL104C         Glucose  0.0500    -0.0200 proteoly… metalloend…
##  4 CFT2   YLR115W         Glucose  0.0500    -0.330  mRNA pol… RNA binding
##  5 SSO2   YMR183C         Glucose  0.0500     0.0500 vesicle … t-SNARE ac…
##  6 PSP2   YML017W         Glucose  0.0500    -0.690  biologic… molecular …
##  7 RIB2   YOL066C         Glucose  0.0500    -0.550  riboflav… pseudourid…
##  8 VMA13  YPR036W         Glucose  0.0500    -0.750  vacuolar… hydrogen-t…
##  9 EDC3   YEL015W         Glucose  0.0500    -0.240  deadenyl… molecular …
## 10 VPS5   YOR069W         Glucose  0.0500    -0.160  protein … protein tr…
## # ... with 198,420 more rows

But let’s take a look to see what this data originally looked like.

yorig <- read_csv("data/brauer2007_messy.csv")
# View(yorig)
yorig

## # A tibble: 5,536 x 40
##    GID    YORF   NAME      GWEIGHT   G0.05    G0.1   G0.15    G0.2   G0.25
##    <chr>  <chr>  <chr>       <int>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##  1 GENE1… A_06_… SFB2::YN…       1 -0.240  -0.130  -0.210  -0.150  -0.0500
##  2 GENE4… A_06_… NA::YNL0…       1  0.280   0.130  -0.400  -0.480  -0.110 
##  3 GENE4… A_06_… QRI7::YD…       1 -0.0200 -0.270  -0.270  -0.0200  0.240 
##  4 GENE1… A_06_… CFT2::YL…       1 -0.330  -0.410  -0.240  -0.0300 -0.0300
##  5 GENE5… A_06_… SSO2::YM…       1  0.0500  0.0200  0.400   0.340  -0.130 
##  6 GENE2… A_06_… PSP2::YM…       1 -0.690  -0.0300  0.230   0.200   0     
##  7 GENE1… A_06_… RIB2::YO…       1 -0.550  -0.300  -0.120  -0.0300 -0.160 
##  8 GENE5… A_06_… VMA13::Y…       1 -0.750  -0.120  -0.0700  0.0200 -0.320 
##  9 GENE2… A_06_… EDC3::YE…       1 -0.240  -0.220   0.140   0.0600  0     
## 10 GENE2… A_06_… VPS5::YO…       1 -0.160  -0.380   0.0500  0.140  -0.0400
## # ... with 5,526 more rows, and 31 more variables: G0.3 <dbl>,
## #   N0.05 <dbl>, N0.1 <dbl>, N0.15 <dbl>, N0.2 <dbl>, N0.25 <dbl>,
## #   N0.3 <dbl>, P0.05 <dbl>, P0.1 <dbl>, P0.15 <dbl>, P0.2 <dbl>,
## #   P0.25 <dbl>, P0.3 <dbl>, S0.05 <dbl>, S0.1 <dbl>, S0.15 <dbl>,
## #   S0.2 <dbl>, S0.25 <dbl>, S0.3 <dbl>, L0.05 <dbl>, L0.1 <dbl>,
## #   L0.15 <dbl>, L0.2 <dbl>, L0.25 <dbl>, L0.3 <dbl>, U0.05 <dbl>,
## #   U0.1 <dbl>, U0.15 <dbl>, U0.2 <dbl>, U0.25 <dbl>, U0.3 <dbl>

There are several issues here.

Multiple variables are stored in one column. The NAME column contains lots of information, split up by ::’s.
Nutrient and rate variables are stuck in column headers. That is, the column names contain the values of two variables: nutrient (G, N, P, S, L, U) and growth rate (0.05-0.3). Remember, with tidy data, each column is a variable and each row is an observation. Here, we have not one observation per row, but 36 (6 nutrients \(\times\) 6 rates)! There’s no way we could filter this data by a certain nutrient, or try to calculate statistics between rate and expression.
Expression values are scattered throughout the table. Related to the problem above, and just like our heart rate example, expression isn’t a single-column variable as in the cleaned tidy data, but it’s scattered around these 36 columns.
Other important information is in a separate table. We’re missing all the gene ontology information we had in the tidy data (no information about biological process (bp) or molecular function (mf)).

Let’s tackle these issues one at a time, all on a %>% pipeline.

`separate()` the `NAME`

Let’s separate() the NAME column into multiple different variables. The first row looks like this:

SFB2::YNL049C::1082129

That is, it looks like we’ve got the gene symbol, the systematic name, and some other number (that isn’t discussed in the paper). Let’s separate()!

yorig %>% 
  separate(NAME, into=c("symbol", "systematic_name", "somenumber"), sep="::")

## # A tibble: 5,536 x 42
##    GID    YORF   symbol systematic_name somenumber GWEIGHT   G0.05    G0.1
##    <chr>  <chr>  <chr>  <chr>           <chr>        <int>   <dbl>   <dbl>
##  1 GENE1… A_06_… SFB2   YNL049C         1082129          1 -0.240  -0.130 
##  2 GENE4… A_06_… NA     YNL095C         1086222          1  0.280   0.130 
##  3 GENE4… A_06_… QRI7   YDL104C         1085955          1 -0.0200 -0.270 
##  4 GENE1… A_06_… CFT2   YLR115W         1081958          1 -0.330  -0.410 
##  5 GENE5… A_06_… SSO2   YMR183C         1081214          1  0.0500  0.0200
##  6 GENE2… A_06_… PSP2   YML017W         1083036          1 -0.690  -0.0300
##  7 GENE1… A_06_… RIB2   YOL066C         1081766          1 -0.550  -0.300 
##  8 GENE5… A_06_… VMA13  YPR036W         1086860          1 -0.750  -0.120 
##  9 GENE2… A_06_… EDC3   YEL015W         1082963          1 -0.240  -0.220 
## 10 GENE2… A_06_… VPS5   YOR069W         1083389          1 -0.160  -0.380 
## # ... with 5,526 more rows, and 34 more variables: G0.15 <dbl>,
## #   G0.2 <dbl>, G0.25 <dbl>, G0.3 <dbl>, N0.05 <dbl>, N0.1 <dbl>,
## #   N0.15 <dbl>, N0.2 <dbl>, N0.25 <dbl>, N0.3 <dbl>, P0.05 <dbl>,
## #   P0.1 <dbl>, P0.15 <dbl>, P0.2 <dbl>, P0.25 <dbl>, P0.3 <dbl>,
## #   S0.05 <dbl>, S0.1 <dbl>, S0.15 <dbl>, S0.2 <dbl>, S0.25 <dbl>,
## #   S0.3 <dbl>, L0.05 <dbl>, L0.1 <dbl>, L0.15 <dbl>, L0.2 <dbl>,
## #   L0.25 <dbl>, L0.3 <dbl>, U0.05 <dbl>, U0.1 <dbl>, U0.15 <dbl>,
## #   U0.2 <dbl>, U0.25 <dbl>, U0.3 <dbl>

Now, let’s select() out the stuff we don’t want.

yorig %>% 
  separate(NAME, into=c("symbol", "systematic_name", "somenumber"), sep="::") %>% 
  select(-GID, -YORF, -somenumber, -GWEIGHT)

## # A tibble: 5,536 x 38
##    symbol systematic_name   G0.05    G0.1   G0.15    G0.2   G0.25    G0.3
##    <chr>  <chr>             <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##  1 SFB2   YNL049C         -0.240  -0.130  -0.210  -0.150  -0.0500 -0.0500
##  2 NA     YNL095C          0.280   0.130  -0.400  -0.480  -0.110   0.170 
##  3 QRI7   YDL104C         -0.0200 -0.270  -0.270  -0.0200  0.240   0.250 
##  4 CFT2   YLR115W         -0.330  -0.410  -0.240  -0.0300 -0.0300  0     
##  5 SSO2   YMR183C          0.0500  0.0200  0.400   0.340  -0.130  -0.140 
##  6 PSP2   YML017W         -0.690  -0.0300  0.230   0.200   0      -0.270 
##  7 RIB2   YOL066C         -0.550  -0.300  -0.120  -0.0300 -0.160  -0.110 
##  8 VMA13  YPR036W         -0.750  -0.120  -0.0700  0.0200 -0.320  -0.410 
##  9 EDC3   YEL015W         -0.240  -0.220   0.140   0.0600  0      -0.130 
## 10 VPS5   YOR069W         -0.160  -0.380   0.0500  0.140  -0.0400 -0.0100
## # ... with 5,526 more rows, and 30 more variables: N0.05 <dbl>,
## #   N0.1 <dbl>, N0.15 <dbl>, N0.2 <dbl>, N0.25 <dbl>, N0.3 <dbl>,
## #   P0.05 <dbl>, P0.1 <dbl>, P0.15 <dbl>, P0.2 <dbl>, P0.25 <dbl>,
## #   P0.3 <dbl>, S0.05 <dbl>, S0.1 <dbl>, S0.15 <dbl>, S0.2 <dbl>,
## #   S0.25 <dbl>, S0.3 <dbl>, L0.05 <dbl>, L0.1 <dbl>, L0.15 <dbl>,
## #   L0.2 <dbl>, L0.25 <dbl>, L0.3 <dbl>, U0.05 <dbl>, U0.1 <dbl>,
## #   U0.15 <dbl>, U0.2 <dbl>, U0.25 <dbl>, U0.3 <dbl>

`gather()` the data

Let’s gather the data from wide to long format so we get nutrient/rate (key) and expression (value) in their own columns.

yorig %>% 
  separate(NAME, into=c("symbol", "systematic_name", "somenumber"), sep="::") %>% 
  select(-GID, -YORF, -somenumber, -GWEIGHT) %>% 
  gather(key=nutrientrate, value=expression, G0.05:U0.3)

## # A tibble: 199,296 x 4
##    symbol systematic_name nutrientrate expression
##    <chr>  <chr>           <chr>             <dbl>
##  1 SFB2   YNL049C         G0.05           -0.240 
##  2 NA     YNL095C         G0.05            0.280 
##  3 QRI7   YDL104C         G0.05           -0.0200
##  4 CFT2   YLR115W         G0.05           -0.330 
##  5 SSO2   YMR183C         G0.05            0.0500
##  6 PSP2   YML017W         G0.05           -0.690 
##  7 RIB2   YOL066C         G0.05           -0.550 
##  8 VMA13  YPR036W         G0.05           -0.750 
##  9 EDC3   YEL015W         G0.05           -0.240 
## 10 VPS5   YOR069W         G0.05           -0.160 
## # ... with 199,286 more rows

And while we’re at it, let’s separate() that newly created key column. Take a look at the help for ?separate again. The sep argument could be a delimiter or a number position to split at. Let’s split after the first character. While we’re at it, let’s hold onto this intermediate data frame before we add gene ontology information. Call it ynogo.

ynogo <- yorig %>% 
  separate(NAME, into=c("symbol", "systematic_name", "somenumber"), sep="::") %>% 
  select(-GID, -YORF, -somenumber, -GWEIGHT) %>% 
  gather(key=nutrientrate, value=expression, G0.05:U0.3) %>% 
  separate(nutrientrate, into=c("nutrient", "rate"), sep=1)

`inner_join()` to GO

It’s rare that a data analysis involves only a single table of data. You normally have many tables that contribute to an analysis, and you need flexible tools to combine them. The dplyr package has several tools that let you work with multiple tables at once. Do a Google image search for “SQL Joins”, and look at RStudio’s Data Wrangling Cheat Sheet to learn more.

First, let’s import the dataset that links the systematic name to gene ontology information. It’s the brauer2007_sysname2go.csv file available at the data downloads page. Let’s call the imported data frame sn2go.

# Import the data
sn2go <- read_csv("data/brauer2007_sysname2go.csv")

# Take a look
# View(sn2go)
head(sn2go)

## # A tibble: 6 x 3
##   systematic_name bp                           mf                         
##   <chr>           <chr>                        <chr>                      
## 1 YNL049C         ER to Golgi transport        molecular function unknown 
## 2 YNL095C         biological process unknown   molecular function unknown 
## 3 YDL104C         proteolysis and peptidolysis metalloendopeptidase activ…
## 4 YLR115W         mRNA polyadenylylation*      RNA binding                
## 5 YMR183C         vesicle fusion*              t-SNARE activity           
## 6 YML017W         biological process unknown   molecular function unknown

Now, look up some help for ?inner_join. Inner join will return a table with all rows from the first table where there are matching rows in the second table, and returns all columns from both tables. Let’s give this a try.

yjoined <- inner_join(ynogo, sn2go, by="systematic_name")
# View(yjoined)
yjoined

## # A tibble: 199,296 x 7
##    symbol systematic_name nutrient rate  expression bp         mf         
##    <chr>  <chr>           <chr>    <chr>      <dbl> <chr>      <chr>      
##  1 SFB2   YNL049C         G        0.05     -0.240  ER to Gol… molecular …
##  2 NA     YNL095C         G        0.05      0.280  biologica… molecular …
##  3 QRI7   YDL104C         G        0.05     -0.0200 proteolys… metalloend…
##  4 CFT2   YLR115W         G        0.05     -0.330  mRNA poly… RNA binding
##  5 SSO2   YMR183C         G        0.05      0.0500 vesicle f… t-SNARE ac…
##  6 PSP2   YML017W         G        0.05     -0.690  biologica… molecular …
##  7 RIB2   YOL066C         G        0.05     -0.550  riboflavi… pseudourid…
##  8 VMA13  YPR036W         G        0.05     -0.750  vacuolar … hydrogen-t…
##  9 EDC3   YEL015W         G        0.05     -0.240  deadenyly… molecular …
## 10 VPS5   YOR069W         G        0.05     -0.160  protein r… protein tr…
## # ... with 199,286 more rows

# The glimpse function makes it possible to see a little bit of everything in your data.
glimpse(yjoined)

## Observations: 199,296
## Variables: 7
## $ symbol          <chr> "SFB2", "NA", "QRI7", "CFT2", "SSO2", "PSP2", ...
## $ systematic_name <chr> "YNL049C", "YNL095C", "YDL104C", "YLR115W", "Y...
## $ nutrient        <chr> "G", "G", "G", "G", "G", "G", "G", "G", "G", "...
## $ rate            <chr> "0.05", "0.05", "0.05", "0.05", "0.05", "0.05"...
## $ expression      <dbl> -0.24, 0.28, -0.02, -0.33, 0.05, -0.69, -0.55,...
## $ bp              <chr> "ER to Golgi transport", "biological process u...
## $ mf              <chr> "molecular function unknown", "molecular funct...

There are many different kinds of two-table verbs/joins in dplyr. In this example, every systematic name in ynogo had a corresponding entry in sn2go, but if this weren’t the case, those un-annotated genes would have been removed entirely by the inner_join. A left_join would have returned all the rows in ynogo, but would have filled in bp and mf with missing values (NA) when there was no corresponding entry. See also: right_join, semi_join, and anti_join.

Finishing touches

We’re almost there but we have an obvious discrepancy in the number of rows between yjoined and ydat.

nrow(yjoined)

## [1] 199296

nrow(ydat)

## [1] 198430

Before we can figure out what rows are different, we need to make sure all of the columns are the same class and do something more miscellaneous cleanup.

In particular:

Convert rate to a numeric column
Make sure NA values are coded properly
Create (and merge) values to convert “G” to “Glucose”, “L” to “Leucine”, etc.
Rename and reorder columns

The code below implements those operations on yjoined.

nutrientlookup <-
  data_frame(nutrient = c("G", "L", "N", "P", "S", "U"), nutrientname = c("Glucose", "Leucine", "Ammonia","Phosphate", "Sulfate","Uracil"))

yjoined <-
  yjoined %>%
  mutate(rate = as.numeric(rate)) %>%
  mutate(symbol = ifelse(symbol == "NA", NA, symbol)) %>%
  left_join(nutrientlookup) %>%
  select(-nutrient) %>%
  select(symbol:systematic_name, nutrient = nutrientname, rate:mf)

Now we can determine what rows are different between yjoined and ydat using anti_join, which will return all of the rows that do not match.

anti_join(yjoined, ydat)

## # A tibble: 866 x 7
##    symbol systematic_name nutrient   rate expression bp         mf        
##    <chr>  <chr>           <chr>     <dbl>      <dbl> <chr>      <chr>     
##  1 <NA>   YLL030C         Glucose  0.0500         NA <NA>       <NA>      
##  2 <NA>   YOR050C         Glucose  0.0500         NA <NA>       <NA>      
##  3 <NA>   YPR039W         Glucose  0.0500         NA <NA>       <NA>      
##  4 <NA>   YOL013W-B       Glucose  0.0500         NA <NA>       <NA>      
##  5 HXT12  YIL170W         Glucose  0.0500         NA biologica… molecular…
##  6 <NA>   YPR013C         Glucose  0.0500         NA biologica… molecular…
##  7 <NA>   YOR314W         Glucose  0.0500         NA <NA>       <NA>      
##  8 <NA>   YJL064W         Glucose  0.0500         NA <NA>       <NA>      
##  9 <NA>   YPR136C         Glucose  0.0500         NA <NA>       <NA>      
## 10 <NA>   YDR015C         Glucose  0.0500         NA <NA>       <NA>      
## # ... with 856 more rows

Hmmmm … so yjoined has some rows that have missing (NA) expression values. Let’s try removing those and then comparing the data frame contents one more time.

yjoined <-
  yjoined %>%
  filter(!is.na(expression))

nrow(yjoined)

## [1] 198430

nrow(ydat)

## [1] 198430

all.equal(ydat, yjoined)

## [1] TRUE

Looks like that did it!