Introduction to Matrices and Graphs in Data Science

This connector will cover introductory topics in the mathematics of data science, focusing on discrete probability and linear algebra and the connections between them that are useful in modern theory and practice. We will focus on matrices and graphs as popular mathematical structures with which to model data. For examples, as models for term-document corpora, high-dimensional regression problems, ranking/classification of web data, adjacency properties of social network data, etc.

This course connects to the Foundations of Data Science course by providing a unified view of the mathematical methods that underlie the theoretical foundations of data science. Typically, these methods are taught from a statistical perspective, or they are taught from a computer science perspective, or they are taught from a purely mathematical perspective. This connector course will try to drill down in more detail on several key ideas that are used in all these areas. These methods have a rich mathematical underpinnings, and they are also widely useful in practice.

The course will cover some basic mathematics of discrete probability and linear algebra as well as ways they interact in data problems. The choice of discrete rather continuous probability is since it is what is used in practice and since many of the basic results can be illustrated at the freshman level without advanced calculus, etc. Basic insights are developed without getting bogged down into details that matter in traditional numerical presentations of linear algebra but that matter less for data science. Latter topics that use both discrete probability and linear algebra include simple geometric properties of high-dimensional spaces, simple random walks, and spectral methods for clustering, classification, and ranking, all of which have interesting mathematics and are widely-used.

Homework will be a combination of pencil-and-paper and simulations/computations as well as part of a midterm/final project in lieu of exams. It builds on a rich set of examples and publicly-available real and synthetic data sets to illustrate points, and the simulation/computation homework/projects will use these.