Software
1 Some command line tools worth learning
1.1 anaconda
Specifically miniconda
micromamba
or mamba
miniforge
.
Python is one of the most flexible tools for data analysis, but as a general
purpose programming language it does not come with all of the tools installed by
default. As such you will likely need to install some packages like pandas
,
numpy
, or even jupyter notebook
. anaconda
is one of many metapackages containing
almost everything you will need to use. Plus it comes with a great package
manager conda
, which automatically solves dependencies/requirements for you, and
allows you to create your own environments. However, I do not recommend using
the default conda
installer as it has become very slow, so instead I recommend a
third-party installer like mamba
which is much faster.
The default anaconda
installation can be several gigabytes, and over time I’ve
found that I never use some packages. That’s where miniconda
or miniforge
comes
in. It’s a much lighter-weight version of anaconda
with the basic tools to allow
you to install only what you need. Or if you prefer even more minimalism, you
can install micromamba
which is a self-contained package manager that does not
require any python environment of its own. If you don’t know what you will need,
go ahead and install anaconda
.
If you are using miniconda
, for starters I would recommend at least the
following:
mamba install jupyter pandas numpy scipy scikit-learn --yes
1.1.1 conda-forge
Some packages may not be available in the default conda
channel. Oftentimes
these packages are available in the conda-forge
channel. conda-forge
is a
vast, community-led project. As such you can often find what you need, but be
warned that sometimes the dependencies will conflict with your base anaconda
installation.
It’s always worth checking conda-forge
, as there are even non-python binaries
available, such as htop
. You never know what you will find there.
1.1.2 environments
If you have different dependency requirements, due to conflicts with different packages, channels, or if you just want to `freeze` each of your individual project requirements individually then you will want to explore environments.
Using conda
you can easily create your own environments and specify the exact
packages and version numbers you want:
conda create -n myenv python=3.6 scipy=0.15.0
To start/stop using an environment:
activate myenv deactivate myenv
If you are going to use environments, I highly recommend creating a basic .yml
configuration file to keep track of what packages you are using in case you need
to move to a different computer. If you need to change a package, just update
the .yml
file. Creating one for each new project takes only a little bit of
time, but speaking from experience, this can save you hours later on when trying
to figure out why your old code is no longer running on a fresh anaconda
install. For example the above command for creating an environment could be
represented as:
name: myenv channels: - conda-forge - nodefaults dependencies: - python=3.6.0 - scipy=0.15.0
1.1.3 pip
As anaconda
has grown and gotten more complex, sometimes resolving
dependencies can become a very difficult and time-consuming task. This is
especially true if you are working off of a highly customized anaconda
installation–which tends to happen if you have an old install that you keep
adding to.
In times like this it is useful to fall back on the default python package
manager pip
. pip
does some dependency-management, and by default reads from
PyPI, the main Python Package Index, but can be tweaked to install from GitHub,
or from local files.
If you find yourself frustrated by conda
don’t forget, there is always pip
!
Fortunately, it is also very easy to add pip
packages to your environment .yml
files:
dependencies: - pip: - scipy=0.15.0
1.1.4 conda
for R
Yes I know, this is blasphemy. However, sometimes just getting a basic R
installation up and running can take some time, as you might run into issues
getting things to compile correctly on your system. In fact, getting R
set up
on different machines is what drove me to switch to python and to try anaconda
in the first place. anaconda
has actually supported R
for years now, and
just like with python it is an easy way to get started. Plus, with projects like
IRkernel
, which allows you to run R
kernels in jupyter notebook
it is a
pretty good way to do interactive data analysis. Most R
packages can be
installed via conda
with r-[package name]
.
Much like anaconda
for python, I recommend that new R
users start with
tidyverse
, and data.table
for getting started with data analysis:
conda install r-base r-essentials r-irkernel r-tidyverse r-data.table
If, like me, you find yourself constantly playing with different R
packages
then you may actually want to rely on R
’s built-in install.packages()
, or
the devtools::install_github
included in tidyverse
. This may seem
counterintuitive, as I have (hopefully) just convinced you that conda
is the
best tool for package management in every context–but just as you may want to
fall back on pip
for python, you may find it easier to just use the default
package manager in R
after you do your initial conda install
.
1.2 curl
These days much of the data you will want to analyze lives on the web. If you
want to get this data you will run into terms like HTTP, FTP, REST APIs, JSON.
If you want to work with these things you should probably learn curl
.
The basic usage of curl
is simple, yet endlessly customizable:
curl [options] [URL...]
Why curl
over something like wget
? curl
by default writes to STDOUT
,
while wget
writes directly to disk. This gives curl
a significant advantage
in being a flexible utility in between pipes.
Bonus tip: curl
is available in conda-forge
.
1.3 jq
jq
is like awk
but specifically for json files. It is fast, flexible, and
plays nicely with your other command-line tools.
It is extremely useful for pretty-printing your json files, and can be easily
used to glob multiple files. The more advanced syntax including select/map
can help you explore very complex files in just a few lines of code.
Bonus tip: jq
is available in conda-forge
.
1.4 xmlstarlet
xmlstarlet
is like awk/sed/grep
for XML files.
It is extremely useful for browsing the structure of your XML files (xmlstarlet
el
), and subsequently extract the relevant information (xmlstarlet sel
). It
is written in C
so it is very fast.
1.5 xsv
Oftentimes you need to look through a csv file through the command line, and
maybe you even want to do some basic analysis. Enter xsv
, which is written in
Rust and is extremely fast and capable. xsv
can quickly filter, join, pretty
print, etc. a csv, which makes it an invaluable tool.
I used to recommend csvkit
which is written in python, but the performance of
xsv
has convinced me to switch for good.
1.6 dtach
dtach
is like a much lighter-weight version of screen
or tmux
. As such, it
is very easy to use and is also a lot less finicky.
Note: you will have to build dtach
from source yourself.
1.7 Rclone
Rclone is a command-line tool for accessing data located in cloud storage, including common enterprise tools like Box, Dropbox, or Google Drive.
It’s fairly easy to use because it has commands that should be familiar to most unix users. Moreover, it is pretty fast.
It also happens to be installed by default on WRDS (/usr/bin/rclone).
1.8 GNU parallel
GNU parallel
is a great command line utility written in Perl which allows for
very fine-tuned control over parallelization. If you are familiar with something
like xargs
, then parallel
is like a more robust, scalable version of
xargs
.
Admittedly the learning curve for parallel
can be a bit high, but it makes
replacing serial loops with parallel tasks very easy.
Suppose you have a script SOMETHING
which you want to run over a list of csv
files in your current directory:
for i in $(find *.csv); do ./SOMETHING $i done
One way to easily parallelize this in bash
is to add &
:
for i in $(find *.csv); do ./SOMETHING $i & done
You could also accomplish the same task with a pipe:
find *.csv | ./SOMETHING
or if the number of csv
files is large you can use xargs
:
find *.csv | xargs ./SOMETHING
If you want more fine tuned control, such as over the number of concurrent jobs,
then that is where parallel
comes in:
find *.csv | parallel -j8 ./SOMETHING
parallel
is very powerful, and can handle things like parsing arguments, and
handle concurrent writing in a safe way. Suppose that your input is a pipe
delimited file that you want to pass as arugments to your script and output to a
single file:
cat INPUT.csv | parallel --colsep '\|' "./SOMETHING {1} {2}" > OUTPUT.csv
Just remember you bash
quoting rules and you will be fine!
2 Some python libraries worth learning
2.1 requests
requests
is a dead-simple HTTP library for python. Like curl
it is an essential
building tool for working with data that lives on the web (aka scraping).
For example, many websites are now built around REST APIs and deliver JSON
payloads. Rather than scraping HTML with something like BeautifulSoup
, lxml
,
or worst of all Selenium
you can save yourself a lot of time and preserve your
sanity by just using requests
to get at the underlying data. All you need is the
Inspect window of your browser, and some patience and soon you will be an API
scraping master.
Bonus tip: https://curl.trillworks.com/ is a great website that converts curl
statements into requests
code. This is especially useful because some browsers
allow you to copy the results of HTTP requests into curl
, which you can easily
convert into requests
code!
2.2 asyncio
asyncio
is part of the python standard library as of python 3.4. It is a library
for running concurrent (single-threaded) code, and brings python to the
forefront of event-driven programming. That is a fancy way of saying that it is
a neat library that can help you write highly parallel code, help you write your
own network apps, or even write some pretty fancy scrapers.
asyncio
has spawned its own ecosystem of libraries, such as aiohttp
which is
like a async version of requests
, and aiofiles
for dealing with the
filesystem asynchronously.
2.3 pandas
You have data. You use python. If these conditions apply, then you should use
pandas
. The genius of pandas
is that provides a DataFrame
, an indexed,
two-dimensional, potentially heterogeneous and hierarchical table of rows and
columns. In all likelihood 99% of the data you analyze with statistical
techniques will fit into the DataFrame
structure, and pandas
makes working with
DataFrames
a breeze with powerful functions for data serialization and
transformation.
2.4 ujson
ujson
stands for UltraJSON, which is an ultra fast JSON serializer written in C
with python bindings. For most applications you can use it as a drop-in
replacement for the default python json
module, which is written in pure python
and as such is slower.
2.5 benedict
benedict
is a python dictionary subclass that makes navigating dictionaries in
python a lot easier. In many ways it is like BeautifulSoup
, which is very good
at working with irregular or malformed HTML/XML data, but for python
dictionaries, and JSON-like data. It is not as full-featured as many of the
libraries on this list, but it can be very useful if you are working with
irregular JSON data.
2.6 numba
At first glance, numba
seems like an odd choice for python users. The appeal of
python is that it is an interpreted language, and hence does not need to be
compiled to run. numba
is a compiler for python code. However, it is an easy to
use, just-in-time (JIT) compiler using the LLVM compiler library. That means
that it can take very simple python and numpy
code and turn it into LLVM
compiled code that is nearly as fast as C or FORTRAN code.
A good use case for numba
is taking an expensive matrix multiplication and
re-writing it as a loop. This may seem counterintuitive as the whole point of
numpy
is to abstract away from slow python loops for optimized abstracted matrix
operations. Yet these dumb, slow python loops combined with numba
can be
significantly faster than numpy
counterparts if used correctly.
3 Some R
libraries worth learning
3.1 tidyverse
tidyverse
is a metapackage of data analysis tools for R
. In many ways it is
like the anaconda
default installation in that it includes so many of the
essentials. To get started analyzing data in a modern R
setup you will likely
need ggplot2
, dplyr
, stringr
, and purrr
just to name a few. All of these
are part of tidyverse
.
tidyverse
also contains one of the most useful packages in any language:
haven
, which allows you to read SAS
and Stata
files. Look, we can all
pretend like we don’t have co-authors that use these languages, or we can deal
with it and use haven
.
3.2 data.table
data.table
is very fast, and has an intuitive syntax. It is certainly
different from tidyverse::dplyr
, but for those familiar with pandas
,
PyTable, or sql
it may be more intuitive.
Bonus tip: DataCamp has a great cheat sheet for data.table
.
4 Other Useful Software
4.1 DuckDB
DuckDB
is a SQL-style database with very convenient syntax for data analysis. It
works very well with standard tabular data formats, and plays nicely with
python. It is also very fast for analytic workflows, including read/write and
processing data. One of the most useful features is the variety of built-in
datatypes, such as LIST
and STRUCT
which map to python/JSON datatypes. It also
allows for nested or composite datatypes, which are often found in real-world
data. By default, DuckDB
operates in-memory databases. In many ways, it is like
SQLite
but designed for data analysis workflows. It also features a lot of
useful extensions whic facilitate full text search, JSON querying,
reading/writing remote files over HTTP, and reading from Postgres/SQLite
databases.
DuckDB
now plays nicely with both python and R, and with their respective
dataframes. It is also very very good at reading all sorts of common data files
like csv, json, and parquet.
4.2 ScrapingBee
Sometimes the only way to get data is through traditional web scraping. Scraping
is controversial and at the very least most websites have some sort of rate
limiting or bot restrictions. Other websites are weirdly designed and require
javascript rendering to be able to access content. ScrapingBee
handles the
former by providing headless instances that imitate a real Chrome browser,
running through different proxies (including residential IP addresses) and the
latter through custom javascript rendering. Thus, ScrapingBee
makes scraping
much much easier in the modern age.
Unlike the rest of the tools on this page, ScrapingBee
is a service you have to
pay for. But the rates are fairly reasonable considering that spinning up a
custom solution (e.g., multiple AWS instances) is costly and time-consuming.
Best of all ScrapingBee
is fairly easy to use. If you can write a simple
requests
script (see above), then converting it to use ScrapingBee
is trivial.
Last but not least, ScrapingBee
only charges you for successful requests, so if
their proxies get blocked it doesn’t cost you any additional money.
Warning: If you pay for a higher tier with concurrency, do not follow the
ScrapingBee
tutorials and try to use multiprocessing
or concurrent.futures
for
parallelism. Although it is syntactically simple, they run into the Python GIL
and will lock after a few iterations. Instead, use aiohttp
and just replace the
url field with the ScrapingBee
url, include your API Key as a parameter, and the
url you want to scrape as another parameter.