4 import and tidying

4.1 data import

To analyze data that is stored on your own computer you can indeed import it into RStudio.

The easiest way to do this is to use the interactive command readCSV(), a function that comes with the phylochemistry source command. You run readCSV() in your console, then navigate to the data on your hard drive.

Another option is to read the data in from a path. For this, you will need to know the “path” to your data file. This is essentially the street address of your data on your computer’s hard drive. Paths look different on Mac and PC.

  • On Mac: /Users/lucasbusta/Documents/sample_data_set.csv (note the forward slashes!)
  • On PC: C:\\My Computer\\Documents\\sample_data_set.csv (note double backward slashes!)

You can quickly find paths to files via the following:

  • On Mac: Locate the file in Finder. Right-click on the file, hold the Option key, then click “Copy as Pathname”
  • On PC: Locate the file in Windows Explorer. Hold down the Shift key then right-click on the file. Click “Copy As Path”

With these paths, we can read in data using the read_csv command. We’ll run read_csv("<path_to_your_data>"). Note the use of QUOTES ""! Those are necessary. Also make sure your path uses the appropriate direction of slashes for your operating system.

4.2 tidy data

When we make data tables by hand, it’s often easy to make a wide-style table like the following. In it, the abundances of 7 different fatty acids in 10 different species are tabulated. Each fatty acid gets its own row, each species, its own column.

fadb_sample
## # A tibble: 7 × 11
##   fatty_acid Agonandra_brasi… Agonandra_silva… Agonandra_excel… Heisteria_silvi…
##   <chr>                 <dbl>            <dbl>            <dbl>            <dbl>
## 1 Hexadecan…              3.4              1                1.2              2.9
## 2 Octadecan…              6.2              0.1              0.4              0.1
## 3 Eicosanoi…              4.7              3.5              1.7              0.1
## 4 Docosanoi…             77.4              0.4              1                7.4
## 5 Tetracosa…              1.4              1                1.4              1.7
## 6 Hexacosan…              1.9             12.6             23.1             46.6
## 7 Octacosan…              5               81.4             71.3             41.2
## # … with 6 more variables: Malania_oleifera <dbl>, Ximenia_americana <dbl>,
## #   Ongokea_gore <dbl>, Comandra_pallida <dbl>, Buckleya_distichophylla <dbl>,
## #   Nuytsia_floribunda <dbl>

While this format is very nice for filling in my hand (such as in a lab notebook or similar), it does not groove with ggplot and other tidyverse functions very well. We need to convert it into a long-style table. This is done using pivot_longer(). You can think of this function as transforming both your data’s column names (or some of the column names) and your data matrix’s values (in this case, the measurements) each into their own variables (i.e. columns). We can do this for our fatty acid dataset using the command below. In it, we specify what data we want to transform (data = fadb_sample), we need to tell it what columns we want to transform (cols = 2:11), what we want the new variable that contains column names to be called (names_to = "plant_species") and what we want the new variable that contains matrix values to be called (values_to = "relative_abundance"). All together now:

pivot_longer(data = fadb_sample, cols = 2:11, names_to = "plant_species", values_to = "relative_abundance")
## # A tibble: 70 × 3
##    fatty_acid        plant_species           relative_abundance
##    <chr>             <chr>                                <dbl>
##  1 Hexadecanoic acid Agonandra_brasiliensis                 3.4
##  2 Hexadecanoic acid Agonandra_silvatica                    1  
##  3 Hexadecanoic acid Agonandra_excelsa                      1.2
##  4 Hexadecanoic acid Heisteria_silvianii                    2.9
##  5 Hexadecanoic acid Malania_oleifera                       0.7
##  6 Hexadecanoic acid Ximenia_americana                      3.3
##  7 Hexadecanoic acid Ongokea_gore                           1  
##  8 Hexadecanoic acid Comandra_pallida                       2.3
##  9 Hexadecanoic acid Buckleya_distichophylla                1.6
## 10 Hexadecanoic acid Nuytsia_floribunda                     3.8
## # … with 60 more rows

Brilliant! Now we have a tidy, long-style table that can be used with ggplot.