Overview
Modern data science is impossible without some
   understanding of the Unix command line.  Unix is a family of
   computer operating systems including the Mac’s OS X and Linux
   (technically, Linux is a Unix clone); Windows has Unix emulators,
   which allow running Unix commands.  We will use the terms Unix
   and Linux interchangeably to mean operating systems that use the unix
   shell commands--our present topic.
 
As one’s proficiency with the unix shell increases,
   so too does efficiency in completing and automating many tasks. This
   document is a tutorial in some of the basic unix command-line
   utilities used for data gathering, searching, cleaning and
   summarizing. Generally, unix commands are very efficient, and can be
   used to process data that is quite large, beyond what can be loaded
   into your computer’s main memory, and can easily handle
   workloads far exceeding the capabilities of tools like Excel. 
 
Getting Started
This section is designed to getting your environment set
   up, putting you in a position to accomplish the tasks and use the
   tools discussed below. If your personal computer is already a linux
   distribution, you doubtless already know how to access your
   system’s terminal; linux users aren’t currently supported
   in this tutorial.  
 
OS-X: There are several options
   for OS-X users. Because OS-X is a unix system under the hood, it
   comes prepackaged with a command line shell called
   “Terminal”. The easiest way to open Terminal is through
   Spotlight (the magnifying glass in the top right, or simply
   command+space), then type “terminal”
   and then return. You can customize the appearance of terminal and
   open new tabs by pressing command+t.
 
Most users prefer iTerm2,
   a more performance oriented and feature-rich Terminal replacement.
   The default green and black matrix color scheme is also much easier
   on the eyes than the black and white default of Terminal- though
   these settings are all customizable. 
 
Windows: While windows
   doesn’t have a built-in unix shell, there is Cygwin, a
   robust open source unix emulator that includes many of the more
   widely used unix utilities. Additionally, if you are on a windows
   machine but have access to a remote unix machine, you can connect to
   a remote shell securely using PuTTY,
   a utility for managing remote connections and facilitating terminal
   access. 
 
Once you have access to the terminal, try it out! Type
   “pwd”. This will
   tell you your current directory. If you want to know the contents of
   this directory, type “ls -A”
 
Data
Many of the shell scripting examples below are performed
   on the following example data:
 
123        1346699925        11122        foo
   bar
 
222        1346699955        11145        biz
   baz
 
140        1346710000        11122        hee
   haw
 
234        1346700000        11135        bip
   bop
 
146        1346699999        11123        foo
   bar
 
99        1346750000        11135        bip
   bop
 
99        1346750000        11135        bip
   bop
 
The columns in this tab-separated data correspond to
   [order id] [time of order] [user id] [ordered item], something
   similar to what might be encountered in practice.  Note that the
   [ordered item] itself has spaces in the item names (e.g., “foo
   bar”).  If you want to try the scripts described below on
   this data, simply open up your text editor, copy the sample data
   given above, save and exit your editor. There is
   a simple terminal based editor called nano. Simply type nano to open the editor. You can then
   paste in the sample data given above, and to exit, press control+x.
   Upon exiting, you will be asked to save the file. 
 
Alternately, the sample data file is hosted online. You
   can use terminal commands to copy this remote file. Simply type:
 
This will pull the file to the active directory in the
   current terminal, creating a new file called
   “sample.txt”. 
 
Command-line Utilities
This section gives some crucial unix utilities, ordered
   roughly according to their usefulness to the data scientist. This
   list is by no means exhaustive, and the ordering is not perfect;
   different tasks have different demands. Fortunately, unix has been
   around for a while and has an extremely active user base, developing
   a wide range of utilities for common data processing, networking,
   system management, and automation tasks. 
 
Once you are familiar
    with programming, you will be able to write your own
   scripts that can perform tasks which you are unable to accomplish
   using existing unix utilities. The tradeoff between writing
   hand-coded scripts and existing unix utilities is an increase in
   flexibility at the expense of increased development time, and
   therefore a reduction in the speed of iteration. 
 
Here are some of the more useful unix utilites:
 
- grep: a utility for pattern matching. grep is by far the most useful unix utility. While grep is conceptually very simple, an effective developer or data scientist will no doubt find themselves using grep dozens of times a day. grep is typically called like this:
grep [options] [pattern] [files]
 
With no options specified, this simply looks for the
   specified pattern in the given files, printing to the console only
   those lines that match the given pattern. 
 
Example: 
 
grep ‘foo bar’ sample.txt
 
Will give: 
 
123        1346699925        11122        foo
   bar
 
146        1346699999        11123        foo
   bar
 
This in itself can be very useful, scraping large volumes
   of data to find what you’re looking for.  
 
The power of grep really shows when different command
   options are specified. Below are just a sample of the more useful
   grep options
 
- -v : Inverted matching. In this setting, grep will return all the input lines that do not match the specified pattern.
Example: 
 
grep -v ‘foo bar’ sample.txt
 
Will give: 
 
222        1346699955        11145        biz
   baz
 
140        1346710000        11122        hee
   haw
 
234        1346700000        11135        bip
   bop
 
99        1346750000        11135        bip
   bop
 
99        1346750000        11135        bip
   bop
 
- -R : Recursive matching. Here grep descends sub folders, applying the pattern on all files encountered. Very useful if you’re looking to see if any logs have lines that you’re interested in, or to find the source code file containing the function you’re interested in.
Example:
 
        cd
   ..;                        (this
   will bring you to “up” one folder)
 
        grep -R ‘foo bar’ .   (here
   . refers to the current directory)
 
        cd
   -;                                (brings
   you to the original folder)
 
Will perform a recursive search and return something
   like:
 
./data/sample.txt:123        1346699925        11122        foo
   bar
 
./data/sample.txt:146        1346699999        11123        foo
   bar
 
- -P : Perl regular expressions: Here patterns are perl regular expressions. This gives the user the ability to match extremely flexible patterns.
                Example:
 
                        grep -P '23\s+foo' sample.txt
 
                Will
   give:
 
                        146        1346699999        11123        foo
   bar
 
- sort: an extremely efficient implementation of external merge sort. In a nutshell, this means the sort utility can order a dataset far larger than can fit in a system’s main memory. While sorting extremely large files does drastically increase the runtime, smaller files are sorted quickly. Typically called like:
sort [options] [file]
 
Example:
 
        sort sample.txt
 
Will give:
 
123        1346699925        11122        foo
   bar
 
140        1346710000        11122        hee
   haw
 
146        1346699999        11123        foo
   bar
 
222        1346699955        11145        biz
   baz
 
234        1346700000        11135        bip
   bop
 
99        1346750000        11135        bip
   bop
 
99        1346750000        11135        bip
   bop
 
Useful both as a component of larger shell scripts, and
   independently, as a tool to, say, quickly find the most active users,
   or to see the most frequently loaded pages on a domain. Some useful
   options: 
 
- -r : reverse order. Sort the input in descending order.
                Example:
  
 
                        sort -r sample.txt
 
                Will
   give:
 
99        1346750000        11135        bip
   bop
 
99        1346750000        11135        bip
   bop
 
234        1346700000        11135        bip
   bop
 
222        1346699955        11145        biz
   baz
 
146        1346699999        11123        foo
   bar
 
140        1346710000        11122        hee
   haw
 
123        1346699925        11122        foo
   bar
 
- -n : numeric order. Sort the input in numerical order as opposed to the default lexicographical order.
                Example:
 
                        sort -n sample.txt
 
                Will
   give:
 
99        1346750000        11135        bip
   bop
 
99        1346750000        11135        bip
   bop
 
123        1346699925        11122        foo
   bar
 
140        1346710000        11122        hee
   haw
 
146        1346699999        11123        foo
   bar
 
222        1346699955        11145        biz
   baz
 
234        1346700000        11135        bip
   bop
 
- -k n: sort the input according to the values in the n-th column. Useful for columnar data. See also the -t option to specify the text used to specify columns.
                Example:
 
                        sort -k 2 sample.txt
 
                Will
   give: 
 
123        1346699925        11122        foo
   bar
 
222        1346699955        11145        biz
   baz
 
146        1346699999        11123        foo
   bar
 
234        1346700000        11135        bip
   bop
 
140        1346710000        11122        hee
   haw
 
99        1346750000        11135        bip
   bop
 
99        1346750000        11135        bip
   bop
 
- uniq: remove sequential duplicates: prints only those unique sequential lines from a file.
        Example:
 
                uniq sample.txt
 
        Will
   give: 
 
123        1346699925        11122        foo
   bar
 
222        1346699955        11145        biz
   baz
 
146        1346699999        11123        foo
   bar
 
234        1346700000        11135        bip
   bop
 
140        1346710000        11122        hee
   haw
 
99        1346750000        11135        bip
   bop
 
Used with the -c option,
   uniq will report the number of duplicates of each line in the
   sequence. 
 
Example:
 
uniq -c sample.txt
 
Will give:
 
   1
   123        1346699925        11122        foo
   bar
 
   1
   222        1346699955        11145        biz
   baz
 
   1
   140        1346710000        11122        hee
   haw
 
   1
   234        1346700000        11135        bip
   bop
 
   1
   146        1346699999        11123        foo
   bar
 
   2
   99        1346750000        11135        bip
   bop
 
- cut: Used to select or “cut” certain fields (usually columns) from input. Cut is typically used with the -f option to specify a comma-separated list of columns to be emitted.
Example:
 
cut -f2,4 sample.txt
 
Will give:
 
1346699925        foo
   bar
 
1346699955        biz
   baz
 
1346710000        hee
   haw
 
1346700000        bip
   bop
 
1346699999        foo
   bar
 
1346750000        bip
   bop
 
1346750000        bip
   bop
 
An important option with the cut utility is -d, which is
   used to specify the string used to separate the fields in the input.
   While the default value of tab is appropriate for our sample file, if
   spaces were used instead of tabs, we could change the above command
   to:
 
cut -d” ” -f2,4 sample.txt
 
- cat: concatenate the contents of the specified files to standard output.
Example:
 
cat sample.txt
 
Will give:
 
123        1346699925        11122        foo
   bar
 
222        1346699955        11145        biz
   baz
 
140        1346710000        11122        hee
   haw
 
234        1346700000        11135        bip
   bop
 
146        1346699999        11123        foo
   bar
 
99        1346750000        11135        bip
   bop
 
99        1346750000        11135        bip
   bop
 
- ls: lists the contents of a directory or provide information about the specified file. Typical usage:
ls [options] [files or directories]
 
By default, ls simply lists the contents of the current
   directory. There are several options that when used in
   conjunction with ls give more detailed information about the files or
   directories being queried. Here are a sample:
 
- -A: list all of the contents of the queried directory, even hidden files.
- -l: detailed format, display additional info for all files and directories.
- -R: recursively list the contents of any subdirectories.
- -t: sort files by the time of the last modification.
- -S: sort files by size.
- -r: reverse any sort order.
- -h: when used in conjunction with -l, gives a more human-readable output.
- cd: change the current directory.
- head/tail: output the first (last) lines of a file. Typically used like:
head -n 5 sample.txt
 
tail -n 5 sample.txt
 
The -n option
   specifies the number of lines to be output, the default value is 10.
  
 
tail, when used with the -f
   option, will output the end of a file as it is written to. This is
   useful is a program is writing output or logging progress to a file,
   and you want to read it as it is happening. 
 
- less: navigate through the contents of a file or through the output of another script or utility. When invoked like:
less [some big file]
 
less enters an interactive mode. In this mode, several
   keys help you navigate the input file. Some key commands are:
 
- (space): space navigates forward one screen.
- (enter): enter navigates forward one line.
- b: navigates backwards one screen
- y: navigates backwards one line.
- /[pattern]: search forwards for the next occurrence of [pattern]
- ?[pattern]: search backwards for the previous occurrence of [pattern]
Where [pattern] can be a basic string or a regular
   expression. 
 
- wc: compute word, line, and byte counts for specified files or output of other scripts. Particularly useful when used in concert with other utilities such as grep, sort, and uniq. Example usage:
wc sample.txt
 
This will give: 
 
  7  35 201 sample.txt
 
indicating the number of lines, words, and bytes in the
   file respectively. There are some useful flags for wc that will help
   you answer specific questions quickly: 
 
- -l: get the number of lines from the input.
Example:
 
wc -l sample.txt
 
Will give:
 
7 sample.txt
 
- -w: get the number of words in the input.
Example:
 
wc -w sample.txt
 
Will give:
 
35 sample.txt
 
- -m: the number of characters in the input.
Example:
 
wc -m sample.txt
 
Will give:
 
201 sample.txt
 
- -c: the number of bytes in the input.
Example:
 
wc -c sample.txt
 
Will give:
 
201 sample.txt
 
Here, the number of bytes and characters are the same;
   all characters used are just one byte. 
 
Pipes
Pipes provide a way of connecting the output of one unix
   program or utility to the input of another, through standard input
   and output. Unix pipes give you the power to compose various
   utilities into a data flow and use your creativity to solve problems.
   Utilities are connected together (“piped” together) via
   the pipe operator, |. For
   instance, if you want to know how many records in the sample data
   file do not contain “foo bar”, you can compose a data
   flow like this:
 
cat sample.txt | grep -v ‘foo bar’
   | wc -l
 
This will give: 
 
5
 
Using wc at the end of a pipe to count the number of
   matching output records is a common pattern. Recalling that uniq
   removes any sequential duplicates, we can count the number of unique
   users making purchases in our file by composing a data flow like
   this:
 
cat sample.txt | cut -f3 | sort | uniq  |
   wc -l
 
This will give:
 
4
 
Or, if you want count how many transactions each user has
   appeared in:
 
cat sample.txt | cut -f3 | sort | uniq -c
 
This will give:
 
      2 11122
 
      1 11123
 
      3 11135
 
      1 11145
 
To now order the users by number of transactions made,
   you can try something like: 
 
cat sample.txt | cut -f3 | sort | uniq -c |
   sort -nr
 
Which will return: 
 
      3 11135
 
      2 11122
 
      1 11145
 
      1 11123
 
Notice here, that the -r and
  -n flags for the sort
   command are combined. This is common shorthand and is acceptable for
   any unix utility. 
 
More Useful Command Line Utilities:
- xargs: used for building and executing terminal commands. Often used to read input from a pipe, and perform the same command on each line read from the pipe. For instance, if we want to look up all of the .txt files in a directory and concatenate them, we can use xargs:
ls . | grep ‘.txt’ | xargs cat
 
- find: search directories for matching files. Useful when you know the name of a file (or part of the name), but do not know the file’s location in a directory. Example:
find ~/ -name ‘sample*’
 
- sed: A feature-rich stream editor. Useful for performing simple transformations on an input stream- input from a pipe or from a file. For instance, if we want to replace the space in the fourth column of our sample input with an underscore, we can use sed:
cat sample.txt | sed ‘s/ /_/’
 
This will give:
 
123        1346699925        11122        foo_bar
 
222        1346699955        11145        biz_baz
 
140        1346710000        11122        hee_haw
 
234        1346700000        11135        bip_bop
 
146        1346699999        11123        foo_bar
 
99        1346750000        11135        bip_bop
 
99        1346750000        11135        bip_bop
 
- screen: Manager for terminal screens. Can be used to “re-attach” terminal sessions so you can continue your work after logging out, etc. Particularly useful when working on a remote server.
- top: displays currently running tasks and their resource utilization.
- fmt: a simple text formatter, often used for limiting the width of lines in a file. Typically useage uses a -width flag, where width is a positive integer denoting the number of words to go on each output line, where words are sequences of non-whitespace characters. For instance, if we want to get all the individual “words” for our sample input file, one word per line, we can use (using head to limit output):
fmt -1 sample.txt | head
 
This will give:
 
123
 
1346699925
 
11122
 
foo
 
bar
 
222
 
1346699955
 
11145
 
biz
 
baz
 
Pick your Text Editor
While the editor from nano is a
   great way to jump right in to coding, there are a rich set of editors
   available in the terminal that are useful for exploring and modifying
   files in addition to writing source code for programming languages.
   nano is the simplest common text editor, vim and emacs are both far
   more complex and far more feature-rich. Choosing vim or emacs entails
   climbing a learning curve- there are many special key combinations
   that do useful things and special modes optimized for certain common
   tasks. However, this power, once mastered can greatly increase your
   effectiveness as a programmer, greatly reducing your time between
   iterations. 
 
For experienced programmers, choosing an editor is almost
   like choosing a religion: one is right and all others are wrong. Some
   programmers are very vocal about this. However, in the end of the
   day, all editors do the same things, albeit offering different paths
   to get there. When you feel you’re ready to try out a new text
   editor, my advice is pick one that your friends or colleagues are
   familiar with. They can get you on your feet quickly with a few
   useful tips, and get you unstuck when you run into trouble. 
 
