CHEM 4725/5725 | busta lab

Course Overview

In the natural sciences, accurate data analysis, visualization, statistical treatment, and communication are imperative. This course emphasizes the importance of efficiently analyzing increasing volumes of data and articulating findings clearly. Utilizing R for data analysis, participants will gain proficiency in tools for hypothesis generation, including visual data representation and summarizing large datasets. The course also explores statistical tests and models for hypothesis evaluation, equipping students to address specific analytical questions. Effective communication strategies for results are also covered, supporting clarity in conveying complex scientific data.

Course Details

Credits: 3
Meets: MWF 8:00-8:50 AM
Instructor: Lucas Busta
Email: bust0037@umn.edu

Learning Outcomes

Import, organize, and transform natural sciences datasets
Generate hypotheses by identifying trends using summary statistics, PCA, and clustering
Statistically evaluate hypotheses using ANOVA, regression, and machine learning models
Present findings in visual, written, and oral formats
Provide professional peer critique of technical work

module 1

Getting Started

Set up your R environment and get oriented with the course tools and workflow. We survey the landscape of bioanalytical data analysis before diving in.

Overview of bioanalytical data analysis
R and RStudio installation

Open in textbook

module 2

Data Visualization

Learn to encode variables visually and build intuitive, interpretable representations of complex datasets. We explore the principles of good visualization and common pitfalls.

Data visualization I: fundamentals
Data visualization II: advanced encodings
Data visualization III: communication

Open in textbook

module 3

Statistical Methods

Transform and summarize large datasets, find patterns through clustering and dimensionality reduction, and test hypotheses using parametric and non-parametric approaches.

Data wrangling and summaries
Hierarchical clustering
Dimensional reduction (PCA)
Flat clustering (k-means)
Comparing means (ANOVA, Tukey, Kruskal)

Open in textbook

module 4

Models

Quantify relationships and classify unknowns using regression and machine learning. Learn to evaluate model performance and interpret results in a scientific context.

Model use
Single linear regression
Multiple linear regression
Assessing regression models
Random forests

Open in textbook

module 5

Sequence Analysis

Extend data analysis skills to biological sequences. Analyze homology, build alignments, and construct and interpret phylogenies from DNA and protein data.

Homology and sequence databases
Sequence alignments
Phylogeny construction
Phylogenetic analysis

Open in textbook

module 6

Language Models

Learn how language models transform text into numbers and generate new text. We use these tools to analyze scientific literature and produce automated text outputs.

Text embeddings (GloVe, transformers)
Embedding-based literature analysis
Generative text models

Open in textbook

module 7

Protein Language Models

Apply the concepts of language modeling to protein sequences. We explore how protein language models learn biochemical grammar from sequence data and predict structure and function.

Protein sequence embeddings
ESM and related models
Structure and function prediction

Open in textbook

capstone

Final Project

Each student selects a large dataset and applies course techniques to produce a scientific mini-manuscript, complete with abstract, introduction, results, discussion, and conclusion. A mid-semester Data Update presentation and a peer review of a classmate's work round out the experience.

Course Textbook