Tools of the Data Scientist

The intent of this document is to give some passing familiarity with some commonly used tools of the Data Scientist. While the technologies listed below are by no means exhaustive, they are frequently seen in practice, and a proficiency with these technologies will go a long way towards the fast development of useful data-driven systems.

Unix: Unix is a family of operating systems that share a basic philosophy, structure, and command syntax.  Running programs and operating on data via the Unix command line is very common in industry, and knowledge of a few tools can help answer questions/solve common problems very quickly. Mac OS X is a unix-based operating system.  One can access the command line via the preinstalled  “Terminal” application. Cygwin (http://www.cygwin.com/) is a Unix-like command-line environment for Windows.  Linux is one of the most popular (and free) Unix systems, used by many businesses from tiny to huge.

There are many tutorials on the Unix command line. For instance, see:

  1. beginner guide - (http://www.ee.surrey.ac.uk/Teaching/Unix/)
  2. bash (a particular shell type) guide (http://mywiki.wooledge.org/BashGuide)
  3. The linux documentation project (http://tldp.org/guides.html)

Mac users might like:

  1. Intro to the OS X Unix Command Line - (http://www.matisse.net/OSX/intro_unix/0_outline.html)
  2. OS X Tutorial for Beginners - (http://acad.coloradocollege.edu/dept/pc/SciCompLab/UnixTutorial/)

The most important thing to remember is that the manual (man page) of any command can be viewed by typing “man {name of command}”

Some command line utilities everyone should be familiar with:

  1. grep - a pattern matching utility, passes through the lines that contain an expression
  2. sort - sorts the lines of a file
  3. uniq - removes duplicates in sequential lines. to get unique lines in a file, combine with sort
  4. cut - select particular columns from the input
  5. awk - simple data formatting tool
  6. cat - concatenate (print) a file
  7. less - navigate through a file
  8. wc - word count. also useful for getting the number of lines and number of characters.

Additionally, one should be comfortable combining commands together with the linux pipe “|”. For instance, one can print a print the lines of a file, sort them, and then view the unique lines of the result using:

cat somefile.txt | sort | uniq | less

Finally, it is often important to edit text documents, for instance the source code of a program. Unix terminals typically come prepackaged with several reasonable text editors. The most frequently encountered are vi/vim, emacs, and pico/nano. Each has their own features, shortcuts and drawbacks, so it’s up to the user to choose their favorite. I personally prefer emacs.

 


Python: python has become a fairly common scripting and programming language in a variety of contexts, and has gained traction with users such as Google. As a language, it offers a good variety of libraries for many tasks, an Object-oriented programming model, enabling the development of complex systems, and is interpreted rather than compiled, giving faster turn-around in the development cycle.

  1. The official python tutorial http://docs.python.org/tutorial/index.html
  2. Instant hacking http://hetland.org/writing/instant-hacking.html
  3. Learning to Program (http://www.freenetpages.co.uk/hp/alan.gauld/)
  4. Installing python http://wiki.python.org/moin/BeginnersGuide/Download

Some useful python libraries and utilities

  1. IPython: (http://ipython.org/) - A REPL for easy interactive python development.
  2. easy_install: (http://peak.telecommunity.com/DevCenter/EasyInstall) - a convenient package management system for installing libraries
  3. matplotlib: (http://matplotlib.sourceforge.net/) very nice plotting library
  4. numpy and scipy: (http://www.scipy.org/) useful libraries for doing scientific computing

Consider the following example python program (called a script) that reads a file line by line, then prints it back out with a prepended line number

import sys

file = open(‘somefile.txt’, ‘r’)

ct = 1

for line in file:

    sys.stdout.write( str(ct)+"\t"+line )

    ct = ct+1

To run this script put it in a file (for instance using one of the text editors discussed above), for instance newliner.py, and execute by typing on the command line: python newliner.py. The output will be printed to standard out (to the terminal). For easy consumption, you may wish to “pipe” this output to more, eg. python newliner.py | more

SQL: The most common language for managing relational databases.  MySQL is a very commonly used open-source relational database   Frequently encountered as a component of information systems, some knowledge of SQL is a must for a data scientist.  (MySQL is common enough that a data scientist should have specific knowledge of this particular database system.) One should know how to select specific columns from a relation, filter rows with a “WHERE” clause, join tables together (JOINs and UNIONS) and compute common useful aggregations on the results (SUM and COUNT, for instance). Fortunately, the SQL language is simple and very common, and most programming languages have a mature api to MySQL for programmatic access of the data.

Basic tutorials on mysql

  1. The official MySQL tutorial (http://dev.mysql.com/doc/refman/5.0/en/tutorial.html)
  2. Getting started guide (http://dev.mysql.com/tech-resources/articles/mysql_intro.html)
  3. Another tutorial (http://www.tizag.com/mysqlTutorial/)


Hadoop: A framework for distributed computing and data storage. Hadoop includes both a distributed file system (the hdfs), and an implementation of the MapReduce programming pattern. While programming in java Hadoop is maybe a more advanced topic, Hadoop’s standing as the go-to system for data storage makes some degree of comfort with the hdfs a must. Additionally, there are some more recent libraries that greatly simplify large-scale distributed programming in Hadoop.

Some Hadoop documentation

  1. The official Hadoop project page (http://hadoop.apache.org/)
  2. The Hadoop distributed file system (http://hadoop.apache.org/hdfs/)
  3. Dumbo, a library allowing simplified distributed computing in python (https://github.com/klbostee/dumbo/wiki)

Importantly, as a general principal with all of the technologies listed above, it is important to understand that google can be your greatest friend. Almost every problem has been encountered before, simply googling any error messages is likely to give a good explanation and a solution or work-around. For questions that google isn’t able to answer, StackOverflow is an extremely valuable source of information.