Project Guidelines


The project reports will be due on Friday, August 10.

One of the key components of this class will be an independent project. This will allow you to practice using the tools we have learned in class, and gain some real-world experience working with data.

The focus of the project will be on retrieving, storing, querying, and analyzing data. In particular, you will:

  1. Identify a topic of interest.
  2. Acquire relevant datasets.
  3. Integrate these datasets into a database.
  4. Summarize/analyze/query the datasets.
  5. Produce charts/tables/visuals to illustrate your findings.
  6. Present your work in written and oral form.

You should aim to conduct exploratory data analysis, rather than building a statistical or predictive model. You can build a model if you have time at the end, but this is likely to be time consuming and is not the focus of our class.

Data

For this step, you will need to find an interesting source of data and collect the data that you need. You will want to create a database that can be used to store your data. You are encouraged to create an ER diagram first and then translate the ER diagram into a set of tables in MySQL.

Since the summer term is short, you should focus on acquiring data that is readily available and easy to work with; otherwise, you could run the risk of spending all of your time on data acquisition. You can gather data by:

  1. Scraping the web
  2. Querying an API
  3. Importing a CSV/Excel spreadsheet

A list of possible data sources includes:

Category Sample Data Sources
Misc Zillow
Yelp
MSR Open Data
Movement Google Maps
Citibike data
Music LastFM
Spotify Metadata API
Social media Twitter APIs
Facebook API
News NYTimes APIs
Politics and Government World Bank APIs
Sunlight Foundation
New York City Open Data

You can search a large range of APIs at:

You can find more resources in our GitHub repository.

Analysis

Load the data in the database and run at least 3 “interesting” SQL queries (including queries with joins and/or group by). Analyze the results of these queries (e.g., by retrieving the data from SQL using Python and visualizing the results using Pandas, or by doing a statistical analysis). Present at least 3 plots and/or tables. Describe the insights that you learned from this data analysis.

You can look for inspiration on the following sites. The analyses are generally much more sophisticated than what I expect for the final projects (i.e. many are produced by professional journalists), but you might get some ideas:

Deliverable

Create a slide presentation (~10 slides), with the description of your projects and your results. Also submit an iPython notebook with your code and technical details. The presentation should contain:

  1. The stated goal of the project. What are you trying to investigate? What are your hypotheses/expectations?
  2. A description of the data sources. What data do you use? How did you get the data - by downloading a CSV file, by using a web API, or by crawling?
  3. A description of the non-trivial manipulations that you performed on the data, if any.
  4. The database schema. How did you store the data?
  5. An explanation of the queries you ran against the data (no need to show the results), and the objective of each query.
  6. A presentation of the results of your data analysis, either as visualizations, or through some statistical tables.

Final checklist

The notebook should be used to generate a self-contained, replicable analysis. This means that:

  1. If possible, the notebook should contain code for downloading the datasets you used.
  2. If you build a SQL database, the notebook should contain code for creating the schema and uploading data in the database.
  3. The analysis should run beginning to end with a press of the play button.
  4. You should ensure that you eliminate all manual steps required in your analysis.
  5. Ensure that you use Python and SQL in your analysis. You are free to use R in addition to (but not as a replacement of) Python/SQL.
  6. Ensure that you use at least three SQL queries on your data, including GROUP BY and/or JOIN queries. Do not use Pandas as a replacement of the SQL queries. Pandas is extremely efficient in running SQL-like queries (in fact, it runs an in-memory SQL database as a backend) but our goal is to learn SQL.
  7. Ensure that you have at least three plots or tables in your final writeup.
  8. Include text in the notebook, so that your overall analysis is a readable report, not just code.