Software

1 Some command line tools worth learning

1.1 anaconda

Specifically miniconda.

Python is one of the most flexible tools for data analysis, but as a general purpose programming language it does not come with all of the tools installed by default. As such you will likely need to install some packages like pandas, numpy, or even jupyter notebook. anaconda is a metapackage containing almost everything you will need to use. Plus it comes with a great package manager conda, which automatically solves dependencies/requirements for you, and allows you to create your own environments.

However, a default anaconda installation can be several gigabytes, and over time I’ve found that I never use some packages. That’s where miniconda comes in. It’s a much lighter-weight version of anaconda with the basic tools to allow you to install only what you need. If you don’t know what you will need, go ahead and install anaconda.

If you are using miniconda, for starters I would recommend at least the following:

conda install jupyter pandas numpy scipy scikit-learn --yes

1.1.1 conda-forge

Some packages may not be available in the default conda channel. Oftentimes these packages are available in the conda-forge channel. conda-forge is a vast, community-led project. As such you can often find what you need, but be warned that sometimes the dependencies will conflict with your base anaconda installation.

It’s always worth checking conda-forge, as there are even non-python binaries available, such as htop. You never know what you will find there.

1.1.2 environments

If you have different dependency requirements, due to conflicts with different packages, channels, or if you just want to `freeze` each of your individual project requirements individually then you will want to explore environments.

Using conda you can easily create your own environments and specify the exact packages and version numbers you want:

conda create -n myenv python=3.6 scipy=0.15.0

To start/stop using an environment:

activate myenv
deactivate myenv

1.1.3 pip

As anaconda has grown and gotten more complex, sometimes resolving dependencies can become a very difficult and time-consuming task. This is especially true if you are working off of a highly customized anaconda installation–which tends to happen if you have an old install that you keep adding to.

In times like this it is useful to fall back on the default Python package manager pip. pip does some dependency-management, and by default reads from PyPI, the main Python Package Index, but can be tweaked to install from GitHub, or from local files.

If you find yourself frustrated by conda don’t forget, there is always pip!

1.1.4 conda for R

Yes I know, this is blasphemy. However, sometimes just getting a basic R installation up and running can take some time, as you might run into issues getting things to compile correctly on your system. In fact, getting R set up on different machines is what drove me to switch to Python and to try anaconda in the first place. anaconda has actually supported R for years now, and just like with Python it is an easy way to get started. Plus, with projects like IRkernel, which allows you to run R kernels in jupyter notebook it is a pretty good way to do interactive data analysis. Most R packages can be installed via conda with r-[package name].

Much like anaconda for Python, I recommend that new R users start with tidyverse, and data.table for getting started with data analysis:

conda install r-base r-essentials r-irkernel r-tidyverse r-data.table

If, like me, you find yourself constantly playing with different R packages then you may actually want to rely on R’s built-in install.packages(), or the devtools::install_github included in tidyverse. This may seem counterintuitive, as I have (hopefully) just convinced you that conda is the best tool for package management in every context–but just as you may want to fall back on pip for Python, you may find it easier to just use the default package manager in R after you do your initial conda install.

1.2 curl

These days much of the data you will want to analyze lives on the web. If you want to get this data you will run into terms like HTTP, FTP, REST APIs, JSON. If you want to work with these things you should probably learn curl.

The basic usage of curl is simple, yet endlessly customizable:

curl [options] [URL...]

Why curl over something like wget? curl by default writes to STDOUT, while wget writes directly to disk. This gives curl a significant advantage in being a flexible utility in between pipes.

Bonus tip: curl is available in conda-forge.

1.3 jq

jq is like awk but specifically for json files. It is fast, flexible, and plays nicely with your other command-line tools.

It is extremely useful for pretty-printing your json files, and can be easily used to glob multiple files. The more advanced syntax including select/map can help you explore very complex files in just a few lines of code.

Bonus tip: jq is available in conda-forge.

1.4 xmlstarlet

xmlstarlet is like awk/sed/grep for XML files.

It is extremely useful for browsing the structure of your XML files (xmlstarlet el), and subsequently extract the relevant information (xmlstarlet sel). It is written in C so it is very fast.

1.5 csvkit

Sometimes you need something a bit fancier than less–that’s when csvkit comes in. csvkit includes a variety of tools for working with csv files including csvlook, csvjoin, csvformat all of which can greatly enhance your ability to work with csvs in the command line.

1.6 dtach

dtach is like a much lighter-weight version of screen or tmux. As such, it is very easy to use and is also a lot less finicky.

Note: you will have to build dtach from source yourself.

1.7 Rclone

Rclone is a command-line tool for accessing data located in cloud storage, including common enterprise tools like Box, Dropbox, or Google Drive.

It’s fairly easy to use because it has commands that should be familiar to most unix users. Moreover, it is pretty fast.

It also happens to be installed by default on WRDS (/usr/bin/rclone).

1.8 GNU parallel

GNU parallel is a great command line utility written in Perl which allows for very fine-tuned control over parallelization. If you are familiar with something like xargs, then parallel is like a more robust, scalable version of xargs.

Admittedly the learning curve for parallel can be a bit high, but it makes replacing serial loops with parallel tasks very easy.

Suppose you have a script SOMETHING which you want to run over a list of csv files in your current directory:

for i in $(find *.csv); do
    ./SOMETHING $i
done

One way to easily parallelize this in bash is to add &:

for i in $(find *.csv); do
    ./SOMETHING $i &
done

You could also accomplish the same task with a pipe:

find *.csv | ./SOMETHING

or if the number of csv files is large you can use xargs:

find *.csv | xargs ./SOMETHING

If you want more fine tuned control, such as over the number of concurrent jobs, then that is where parallel comes in:

find *.csv | parallel -j8 ./SOMETHING

parallel is very powerful, and can handle things like parsing arguments, and handle concurrent writing in a safe way. Suppose that your input is a pipe delimited file that you want to pass as arugments to your script and output to a single file:

cat INPUT.csv | parallel --colsep '\|' "./SOMETHING {1} {2}" > OUTPUT.csv

Just remember you bash quoting rules and you will be fine!

2 Some python libraries worth learning

2.1 requests

requests is a dead-simple HTTP library for Python. Like curl it is an essential building tool for working with data that lives on the web (aka scraping).

For example, many websites are now built around REST APIs and deliver JSON payloads. Rather than scraping HTML with something like BeautifulSoup, lxml, or worst of all Selenium you can save yourself a lot of time and preserve your sanity by just using requests to get at the underlying data. All you need is the Inspect window of your browser, and some patience and soon you will be an API scraping master.

Bonus tip: https://curl.trillworks.com/ is a great website that converts curl statements into requests code. This is especially useful because some browsers allow you to copy the results of HTTP requests into curl, which you can easily convert into requests code!

2.2 asyncio

asyncio is part of the Python standard library as of Python 3.4. It is a library for running concurrent (single-threaded) code, and brings Python to the forefront of event-driven programming. That is a fancy way of saying that it is a neat library that can help you write highly parallel code, help you write your own network apps, or even write some pretty fancy scrapers.

asyncio has spawned its own ecosystem of libraries, such as aiohttp which is like a async version of requests, and aiofiles for dealing with the filesystem asynchronously.

2.3 ipyparallel

ipyparallel is a parallel programming framework built around IPython. If you are using jupyter then you are already working within the IPython framework. ipyparallel is great at parallelizing tasks that are typically trivial to parallelize such as long declarative for loops or large functional map~/~reduce jobs. I’ve also found ipyparallel useful for running simulations. When it comes to processing data ipyparallel is sure to help in your workflow.

2.4 pandas

You have data. You use Python. If these conditions apply, then you should use pandas. The genius of pandas is that provides a DataFrame, an indexed, two-dimensional, potentially heterogeneous and hierarchical table of rows and columns. In all likelihood 99% of the data you analyze with statistical techniques will fit into the DataFrame structure, and pandas makes working with DataFrames a breeze with powerful functions for data serialization and transformation.

2.5 PyTables/hdf5

Face it, sometimes you will have to write your data to disk. Maybe you don’t want to write your large and wonderful DataFrame to a csv file where you will lose the easy row/column access and fast indexed-filtering. Fortunately PyTables exists, and since it is built on hdf5 it makes reading/writing data on disk super fast. The magic behind hdf5 is that it is ’hierarchical’, and it has internal compression which often means your hdf5 files will be even smaller than your csv equivalents. It also has many of the nice features of working with DataFrames, and even some sql-like convenience functions.

pandas also makes working with PyTables and hdf5 files more generally a breeze:

df.to_hdf(..., format='table')
df.read_hdf(..., where, start, stop, columns, chunksize)

Warning: While PyTables is a great substitute for running a RMDB, or even sqlite note that it is not designed for simultaneous write. It can however be used for simultaneous read-only access. It’s a great data format when you have an ipyparallel workflow where you need to split/apply with a large data file.

2.6 dask

dask is a parallel data processing library. A typical use-case is when you have a bunch of files spread across your filesystem, network, etc. and you want to glob all of them do some minor processing/filtering and produce a combined dataset. Perhaps you could write some bash script to find all of these files and maybe then you cat~/~grep~/~sed~/~awk them together in some way. You might even parallelize this process with xargs or GNU parallel. dask is purpose-built for this task, and can be much easier to scale up.

Invoking dask is very simple:

import dask.bag as db
import ujson as json

files = ! cat /path/to/your/files
b = db.read_text(files, files_per_partition=10000).map(json.loads)

dask also has great diagnostic tools such as a ProgressBar and various profilers.

2.7 ujson

ujson stands for UltraJSON, which is an ultra fast JSON serializer written in C with Python bindings. For most applications you can use it as a drop-in replacement for the default Python json module, which is written in pure Python and as such is slower.

2.8 benedict

benedict is a Python dictionary subclass that makes navigating dictionaries in Python a lot easier. In many ways it is like BeautifulSoup, which is very good at working with irregular or malformed HTML/XML data, but for Python dictionaries, and JSON-like data. It is not as full-featured as many of the libraries on this list, but it can be very useful if you are working with irregular JSON data.

2.9 numba

At first glance, numba seems like an odd choice for Python users. The appeal of Python is that it is an interpreted language, and hence does not need to be compiled to run. numba is a compiler for Python code. However, it is an easy to use, just-in-time (JIT) compiler using the LLVM compiler library. That means that it can take very simple Python and numpy code and turn it into LLVM compiled code that is nearly as fast as C or FORTRAN code.

A good use case for numba is taking an expensive matrix multiplication and re-writing it as a loop. This may seem counterintuitive as the whole point of numpy is to abstract away from slow Python loops for optimized abstracted matrix operations. Yet these dumb, slow Python loops combined with numba can be significantly faster than numpy counterparts if used correctly.

3 Some R libraries worth learning

3.1 Microsoft R Open (MRO)

This is not an R library, but a distribution. Years ago, Revolution Analytics produced a highly optimized R distribution that was significantly faster for analytics than the base R. It leveraged tools like the Intel Math Kernel Library to speed up matrix operations and was fantastic to use. The problem is that unlike R which is free and doesn’t require a license, Revolution R required a license and could be expensive. Microsoft acquired Revolution Analytics, and the distribution is now available for free, as the Microsoft R Open distribution, which is available as the default R distribution through conda!

conda install r-base r-essentials

3.2 tidyverse

tidyverse is a metapackage of data analysis tools for R. In many ways it is like the anaconda default installation in that it includes so many of the essentials. To get started analyzing data in a modern R setup you will likely need ggplot2, dplyr, stringr, and purrr just to name a few. All of these are part of tidyverse.

tidyverse also contains one of the most useful packages in any language: haven, which allows you to read SAS and Stata files. Look, we can all pretend like we don’t have co-authors that use these languages, or we can deal with it and use haven.

3.3 data.table

data.table is very fast, and has an intuitive syntax. It is certainly different from tidyverse::dplyr, but for those familiar with pandas, PyTable, or sql it may be more intuitive.

Bonus tip: DataCamp has a great cheat sheet for data.table.

3.4 felm and alpaca

felm and alpaca are R packages for linear/logistic regressions with high dimensional fixed effects and clustered standard errors. Both are available on CRAN, and are fairly well documented.

4 Other Useful Software

4.1 DuckDB

DuckDB is a SQL-style database with very convenient syntax for data analysis. It works very well with standard tabular data formats, and plays nicely with Python. It is also very fast for analytic workflows, including read/write and processing data. One of the most useful features is the variety of built-in datatypes, such as LIST and STRUCT which map to Python/JSON datatypes. It also allows for nested or composite datatypes, which are often found in real-world data. By default, DuckDB operates in-memory databases. In many ways, it is like SQLite but designed for data analysis workflows. It also features a lot of useful extensions whic facilitate full text search, JSON querying, reading/writing remote files over HTTP, and reading from Postgres/SQLite databases.

4.2 GraphQL

GraphQL is a modern query language for APIs, originally developed at Facebook and now open sourced and maintained by numerous contributors. You may already be familiar with REST APIs, which many websites use to serve their data. GraphQL is a dramatic improvement over traditional REST because it allows for getting all of the data you need in a single request (including nested operations). Hence instead of making thousands of requests and piecing together a dataset, GraphQL allows the user to submit a single query much like a traditional database transaction. Moreover, GraphQL APIs are somewhat self-documenting, meaning that it is fairly easy to inspect the API to figure out all of the things that are in the data, and how they connect to one another. There are many clients for interacting with GraphQL including gql for Python.

4.3 ScrapingBee

Sometimes the only way to get data is through traditional web scraping. Scraping is controversial and at the very least most websites have some sort of rate limiting or bot restrictions. Other websites are weirdly designed and require javascript rendering to be able to access content. ScrapingBee handles the former by providing headless instances that imitate a real Chrome browser, running through different proxies (including residential IP addresses) and the latter through custom javascript rendering. Thus, ScrapingBee makes scraping much much easier in the modern age.

Unlike the rest of the tools on this page, ScrapingBee is a service you have to pay for. But the rates are fairly reasonable considering that spinning up a custom solution (e.g., multiple AWS instances) is costly and time-consuming. Best of all ScrapingBee is fairly easy to use. If you can write a simple requests script (see above), then converting it to use ScrapingBee is trivial. Last but not least, ScrapingBee only charges you for successful requests, so if their proxies get blocked it doesn’t cost you any additional money.

Warning: If you pay for a higher tier with concurrency, do not follow the ScrapingBee tutorials and try to use multiprocessing or concurrent.futures for parallelism. Although it is syntactically simple, they run into the Python GIL and will lock after a few iterations. Instead, use aiohttp and just replace the url field with the ScrapingBee url, include your API Key as a parameter, and the url you want to scrape as another parameter.