Course number: L&S 88-3
Instructor: Shishi Luo

Genomics is triggering a revolution in medical discovery. Students will explore genomic data, including HIV genomics, personal genomics, and DNA forensics, as well as related legal and ethical issues. Biology background not required.

In this connector course, we will interact with a variety of genomic datasets, with specific emphasis on HIV genomics, personal genomics, and DNA forensics. We will learn to use tools, both from the foundations course and introduced in this course, to perform exploratory analyses of genomic metadata and DNA sequence data. Students will also explore the legal and ethical issues concerning the collection and use of genetic data through readings of news articles. By the end of this course, students will be able to perform a quantitative analysis of variation in a gene or genomic region and communicate the results of this analysis effectively to a non-expert audience.

This connector course will make use of the skills taught in the Foundations of Data Analysis (Data 8) as well as introduce material specific to the application of data science to genomic data. For example, students will apply their knowledge of tables and data visualization to explore quantitative characteristics of genomes, such as genome length, across different species. They will apply Bayes Rule to the problem of quantifying whether a defendant is guilty given DNA evidence against them. In the module on personal genomics, they will be introduced to genome-wide association studies, a high-dimensional version of multiple regression.

The nature of genomic data is more messy and less structured than the datasets the students see in the foundations course. Thus, the connector will introduce some basic, elementary techniques for quantitatively characterizing such data. For example, the HIV genome is a string of about 9000 characters and each strain of the virus is slightly different. How does one compare HIV genomes from different infections? We will discuss the challenges of pre-processing sequence data and will learn how to use Hamming Distance as a basic measure of distance between two sequences of the same length. Given the challenge of dealing with nucleotide sequences directly, the connector will also introduce alternative data structures to character strings for representing sequence data.