Syllabus

Professor Katherine Hoffmann Pham khof [at] nyu [dot] edu
Lecture MW 1:00 - 4:10pm Tisch UC 24
Office Hours (Signup sheet) MW 4:15 - 5:15 pm KMC 8-186

Course Description

This course is the recommended starting point for undergraduate students who are interested in working in the rapidly growing fields of data science and data analytics, or who are interested in acquiring the technical and data analysis skills that are becoming increasingly relevant in other disciplines such as finance and marketing. It will provide a basic introduction to programming and cover topics related to the collection, storage, organization, management, and analysis of data, both structured (record-based) and unstructured (such as text). These topics include:

  • Introduction to programming using Python
  • Data modeling and the ER model
  • Relational databases and SQL
  • Accessing data with web APIs
  • Basics of data visualization

We will work mostly with Python (including Pandas) and SQL, plus a few Unix tools that are useful for everyday data handling and processing. At the completion of this course, students will be able to:

  1. Write simple programs for a variety of data handling tasks (e.g., fetch data from the web, data cleaning, and so on)
  2. Retrieve and manage data coming in a variety of formats and from different sources
  3. Store and query data in relational databases
  4. Visualize and effectively present data

Requirements

The course does not have any prerequisites. However, since this is a hands-on class, you are expected to bring your (charged) laptop to every lecture. If you do not have access to a laptop that you can bring to class, please contact me ASAP. Attendance is strongly encouraged, since much of the lecture will involve interactive programming exercises.

Course notes and textbooks

Our primary resource will be a set of notes that are distributed in the form of interactive iPython notebooks.

There is no required textbook, but the following are useful references:

Grading

Homeworks 3 x 5%
Exams 2 x 25%
Final Project 25%
Participation 10%

Course policies

You are free to submit homeworks late, but there is a 3% per day grade penalty for every additional day after the deadline, and you can be at most 7 days late. Given the generous late submission policy, penalties are strictly enforced. Debugging your code can often take much longer than anticipated, so it’s best not to leave assignments for the last minute.

Unless otherwise noted, we follow the default Stern Policies. I feel strongly about academic integrity, and have zero tolerance for copying or cheating. If you are unsure about what constitutes acceptable collaboration, please ask me directly.

I will work with the Henry and Lucy Moses Center to accommodate students with disabilities; please contact me as soon as possible if this applies to you.

Office Hours

I will hold open office hours after class; if you plan to attend, please add yourself to the signup sheet by the end of that day’s class.

If you would like to meet with me individually, please contact me for an appointment. I am happy to discuss your career goals, thoughts on the course, concerns, etc. I do not provide one-on-one tutoring or exam preparation; those types of questions should be directed to me in open office hours so that all students can benefit equally.

E-mail policy

I will respond to e-mails within 24 hours. If you haven’t received a response from me by then, you can assume I’ve forgotten; please send a reminder.

General help and troubleshooting questions should be directed via the Slack channel, since other students might have the same question.

On the night before exams, I will answer all questions received by 7pm.

Projects follow-up course

INFO-UB 24 Projects in Programming and Data Sciences builds on the topics we cover in this course, and also covers web crawling, text analysis, regular expressions, visualization, network analysis, etc. Students who are interested in learning more about practical aspects of programming are highly encouraged to take the follow-up course.

Things that we will not use or cover

We do not plan on using R/STATA/Matlab, or visualization technologies like Tableau or D3.js. If you are interested in visualization specifically, please consider taking INFO-GB 3106 Data Visualization.

Also, this is not a class about statistics, machine learning, or data mining; it does not teach you what to do with your data. Instead, this course will equip you with general-purpose skills for organizing, processing, and exploring data, which you can then apply in more advanced projects or classes such as INFO-UB 57 Data Mining for Business Analytics.

FAQs

Q: Why don’t we use R/STATA/Matlab?

To learn core skills within a unified framework - and to minimize the use of competing syntaxes - we standardize via Python. R/STATA/Matlab excel in targeted applications (e.g. statistical modeling), but can be unwieldy when used for a wider range of computing tasks. On the other hand, Python can achieve many similar results, but it is a more versatile and general-purpose language. I think you’ll find that many of the concepts we cover will help you quickly pick up other languages and programs when the need arises.

Q: Should I know programming to take this class?

A: No, we will learn programming in Python during the class.

Q: I know programming and/or SQL. Is this the right class for me?

A: It depends on your level. Approximately 40% of the course will focus on programming and Python, then 40% on databases and SQL, and 20% on a variety of other topics. If you know programming but not Python and are not familiar with SQL, I think that you will get a lot out of this class. If you already know Python but not SQL, it may be worthwhile, but there will be repetition of things that you know. If you are familiar with both programming and SQL, then this is definitely not the class for you.

Q: I already know Python, SQL, have used some NLP tools, and I am really interested in learning the following couple of topics in more depth…

A: This is not the right class for you. The class is designed to be broad and introductory, not deep and advanced. You should consider taking the data mining class, or a specialized class on the topic of your interest. If you take this class, you are most probably going to be bored, and it will not be a good use of your time.

Q: Will we learn about big data?

A: While we will learn a lot about handling big data sets, most probably we will not cover any “big data” tools, such as Hadoop, Hive, Pig, etc. Instead, we focus on the basics of how to manage and structure data; you will be surprised how far you can go with just a simple relational database and knowledge of SQL alone. Once you add Python in the mix with SQL, your abilities become superpowers. “You’re going to like the way you look” at the end of the class, even without knowing Hadoop.