midterm

Load the wood smoke data by running source(), inspect the data (wood_smoke), then respond to the following prompts. Use themes, scales, etc. to make all of your plots professional and publication quality. Include a figure caption with each plot. If the prompt asks you to answer a question in the figure caption, your response should be a few sentences long. It does not need to be multiple paragraphs long.

  1. Create a plot with three subpanels. In the first, show which wood type, hard wood or soft wood, has the highest abundance of compounds in its smoke (3 pts). Within that wood type’s smoke, which compound class is the most abundant (3 pts) and, still within that wood type’s smoke, what is the single most abundant compound (3 pts)? Show these two things in the second and third subplots, respectively. For each of the comparisons you performed in this question, run a statistical analysis (3x 2 pts) to whether there are significant differences among the quantities being compared. Show the output of the statistical analyses in your plot (3x 1 pt). Write a detailed figure caption for your figure (2 pts). Points will be taken off if there are major visual issues with your plots and/or if text is illegible or overlapping.

Note: in the question above, you may need to filter for analytes with non-zero values when conducting certain statistical test, for example:

wood_smoke %>%
  filter(wood_type == "hardwood") %>%
  filter(abundance_mg_g > 0) %>%
  group_by(compound_class) %>%
  mutate(size = n()) %>%
  filter(size > 2) %>%
  shapiroTest(abundance_mg_g)
## # A tibble: 22 × 4
##    compound_class           variable      statistic        p
##    <chr>                    <chr>             <dbl>    <dbl>
##  1 alkanedioic_acids        abundance_mg…     0.754 4.28e- 5
##  2 alkanoic_acids           abundance_mg…     0.596 3.81e-10
##  3 alkenoic_acids           abundance_mg…     0.624 3.56e- 6
##  4 benzenediols             abundance_mg…     0.907 4.09e- 1
##  5 coumarins_and_flavonoids abundance_mg…     0.817 1.56e- 1
##  6 furans                   abundance_mg…     0.523 2.62e- 5
##  7 guaiacols                abundance_mg…     0.616 1.66e- 9
##  8 lignans                  abundance_mg…     0.610 2.00e- 5
##  9 methyl_alkanoates        abundance_mg…     0.780 1.08e- 3
## 10 methyl_alkenoates        abundance_mg…     0.787 8.48e- 2
## # ℹ 12 more rows
  1. Considering all the species together, regardless of wood type, what are the ten most abundant compounds in wood smoke? Communicate this via a plot that shows averages and standard deviations (9 pts). Run a statistical analysis to determine whether there are significant differences among the abundances of these compounds (7 pts). Show the output of the statistical analyses in your plot (2 pts). Write a detailed figure caption for your figure (2 pts). Points will be taken off if there are major visual issues with your plots and/or if text is illegible or overlapping.

  2. Create a summarized wood smoke data set that contains the sum abundance of each compound class for each species. Conduct a PCA analysis of your summarized data set. Make a figure with three subpanels: a plot of where each species is located in a space defined by the first two principal components (a classical “pca” plot) (6 pts), a plot with ordination information on the compound class level (as opposed to on the single compound level) (6 pts), and a scree plot (6 pts). In your detailed figure caption, answer: How much of the overall variance in the data set is contained within the first two dimensions? (2 pts) Points will be taken off if there are major visual issues with your plots and/or if text is illegible or overlapping.

  3. Using your PCA analysis from question 3, select four compound classes, two that are negatively correlated with each other along dimension 1 and two that are positively correlated with each other along dimension 2. Create a plot that shows the abundances of these four compound classes in each wood species (16 pts). In your figure caption, explain: are these abundances consistent with your understanding of PCA and ordination? Why or why not? (4 pts) Points will be taken off if there are major visual issues with your plots and/or if text is illegible or overlapping.

  4. Create a plot with three subpanels: (i) a dbscan-based clustering of your PCA analysis output from question 3 (5 pts), (ii) a kmeans-based clustering of your PCA analysis output from question 3 (5 pts), and (iii) a hierarchical clustering analysis of the raw data (5 pts). For the dbscan and kmeans analyses, you can choose the number of clusters to create. In your figure caption, compare and contrast the three methods of clustering. In the case of the wood smoke data, does one seem to be more useful than the others? Why or why not? (5 pts). Points will be taken off if there are major visual issues with your plots and/or if text is illegible or overlapping.

  5. Suppose that you are work in a forensics lab. A suspect has been apprehended as part of a murder case. The suspect’s coat smells strongly of wood smoke, and it is known that a bonfire was burning at the scene of the crime. The fire chief reported that red oak was the type of wood the victim was burning at the scene of their murder, but the suspect claims that the smell in the suspect’s coat is from a different bonfire – one that was burning at a party the suspect claims they were attending at the time of the murder. The fire chief investigated the place where the party took place and found a large supply of paper birch firewood. You extracted a segment of the suspects coat and analyzed it with LC-MS and GC-MS to obtain ‘unknown_smoke.csv’. Use that data and your analysis skills to provide a recommendation to the prosecutor in this case.