Practical Data Science, Fall 2012

Data is the new oil. Data is a new class of economic asset. Those were the conclusions of the reports issued by the World Economic Forum at Davos in January 2011 and January 2012. Research published in 2011 by MIT economists shows that companies adopting “data-driven decision-making” achieved significant productivity gains over other firms. In industry, the hottest job these days is the Data Scientist. Data scientists combine technical and statistical skills, analytical thinking, and business acumen. One of the complaints about the data scientists trained in computer science departments is that they’re “just technical”, understanding algorithms well, but lacking important skills in problem formulation, evaluation, and analysis generally. On the other hand, those trained in business schools tend to have underdeveloped technical skills. This course will cover all of these aspects of being a data scientist.

This class is an introduction to the practice of data science. The student will leave the class with a broad set of practical data analytic skills based on building real analytic applications on real data. These skills include accessing and transferring data, applying various analytical frameworks, applying methods from machine learning and data mining, conducting large-scale rigorous evaluations with business goals in mind, and the understanding, visualization, and presentation of results. The student will have experience processing “big data,” the latest buzz concept in a field awash with buzz. Specifically, the student will be able to analyze data that are too big to fit in the computer’s memory, and therefore thwart many standard analytical tools. The student will have experience with unstructured data, for example processing text for applications such as “sentiment analysis” of user-generated content on the web.

Syllabus

Post Sandy Class Schedule

PDS: In-class code and homework solutions hosted on Github

Project Instructions

Relevant Content:

Basic Unix Shell Commands for the Data Scientist

Python: A Tool for the Practical Data Scientist

Installing Python and Data Science Libraries for Mac

Installing Python and Data Science Libraries for Windows

NYC Data Science Meetups


Class 1:

B

Objectives:

To go over the topics covered in this course, the philosophy and course policies. We motivate the importance of this course with a detailed example, web page classification, going over many the steps and choices required to build and deploy a data-driven predictive system in the wild. This lecture conclude with a lab designed to get every student set up with python and the libraries that we will use throughout this course.

Course Learning Objectives

Supplementary Programming Excercises

Homework 1

An Example Predictive System

Supplementary Material:

Data Scientist: The Sexiest Job of the 21st Century

A Python Cheat Sheet

Data Is Useless Without the Skills to Analyze It


Class 2:

Objectives:

To understand data, what it represents and how it is organized. We discuss the primitive components of data and some data structures used for collecting these individual elements. We then talk about how data is represented, discussing common data-representation schemes. After talking about data structures, we talk about unstructured and semi-structured data: text, web logs, and html. We discuss regular expressions, their syntax, their use for filtering and matching text, their usefulness in extracting data and replacing data in text.

Details of Data: Components and Collections

E-Commerce ER Diagram

E-Commerce ER Diagram

Representing Data: CSV, XML, JSON & YAML

Regular Expressions

Class 2 Lab Exercise

Homework 2


Class 3:

Objectives:

To understand database uses and technology. To learn when databases are used and why. How do databases represent the Entity-Relationship Diagram? Querying databases, SQL. Covering the basic SELECT queries and all that is needed to perform rich analytical queries.

Relational Databases and SQL

Using Mysql Workbench

Homework 3

Supplementary Material:

HBR: Getting Control of Big Data

A Regular Expression Reference

Big Opportunities for Big Data Experts

Online Regular Expression Testing

A Visual Explanation of SQL Joins


Class 4:

Objectives:

Building a basic understanding of predictive modeling, the distinction between model training, evaluation, and use. Examples of target variables and independent variables. Web APIs and services. HTTP and RESTful technology. Using web services in programs in order to gather diverse data and perform interesting computations.

Web APIs

Assignment 1

Data Science API List

Supplementary Material:

Building RESTful Web Services in Python

Heroku- Cloud Application Deployment Made Easy


Class 5:

Objectives:

Big data! Learning about just what big data is, scales of data and what problems this presents. What are some techniques for dealing with these challenges. Distributed file systems, Hadoop and MapReduce. Discussing implementation of distributed MapReduce tasks using Hadoop Streaming.

Big Data, Hadoop and MapReduce

Homework 4

Data Center Architecture

Data Center Architecture

MapReduce

MapReduce

Supplementary Material:

Textbook: Mining Massive Datasets

Textbook: Data-Intensive Text Processing with MapReduce

Cloudera Big Data Glossery

IBM: What is Big Data?

CERN Generating a Petabyte of Data Each Second

Facebook Ingests 500 Terabytes Every Day

Wordle: BigData

Class 6:

Objectives:

Guest speaker Dr Jason Davis discusses experiences building and deploying data-driven systems. Topics include collecting data, web metrics, statistics, AB testing, and the precision / recall tradeoff.

Big Data Science at Etsy

Planning, Running, and Analyzing Controlled Experiments on the Web (part1) (part 2) (part 3)

Online Controlled Experiments: Introduction, Learnings, and Humbling Statistics

Class 7:

Objectives:

Data mining, and predictive modeling. Information gain and decision tree models. Linear models, support vector machines and logistic regression. Probability estimation using trees and linear models. Model evaluation, accuracy, ranking metrics, holdout testing. Model complexity and overfitting.

Homework 5

Supplementary Materiel:

Parallelized Stochastic Gradient Descent

Class 8:

Objectives:

Guest speaker Chris Volinsky discusses data mining and machine learning research at AT&T. Applications include improving the understanding of urban environments, and discovering social communication patterns. Additionally Chris discussed his experiences building recommender systems to win the $1M Netflix Prize.

Recommender Systems and the Netflix Prize

Shaping Cities of the Future using Mobile Data

Assessment 2

Class 9:

Objectives:

Guest speaker Kristen Sosulski discuesses data visualization, presenting techniques for telling stories with data, conveying a particular message to an intended audience. She talked about general considerations when constructing visualizations, and gave concrete examples for creating visualizations using matplotlib in python.

Hands on with Data Visualization in Python

Supplementary Materiel:

Interactive Data Visualization for the Web

mbostock: Awesome Visualizations using d3.js

d3.js tutorials

Stanford Data Visualization Course Notes

Class 10:

Objectives:

Guest speaker Troy Raeder discusses online advertising and using data science for targeting online display ads.

Online Targeted Display Advertising for Prospecting

Related Articles:

A Very Short History of Data Science

A Taxonomy of Data Science

The Future of Informatics

Three Sexy Skills of Data Geeks

The Unreasonable Effectiveness of Data

More Data Beats Better Algorithms -- Or Does It?

Mining of Massive Datasets

Code a Facebook App in 20 Minutes with Python

Probability and Statistics Cookbook

Meet the New Boss: Big Data

Visual Python Tutor

The Command Line Crash Course

Learn Linux the Hard Way