Python is one of the best choices when it comes to analyzing data. In fact, Python is THE language when it comes to programming in general. One of the main advantages of Python for data scientists is the abundance of libraries and frameworks for any kind of goal in mind. In this article I present my top-5 data analysis libraries that I regularly use in my workflow.
Every analysis starts with collecting, representing, and storing data. When it comes to representing and storing the data, the priorities have to be a) access speed b) space efficiency and c) ease of use. In my opinion, pandas achieves all of the above. Pandas allows massive dataset manipulations without iterating each row individually. This comes especially handy for things like filters:
The code above, from a dataset of app reviews, takes a sample of 1–2 star rated reviews, published before January 1st, 2021.
Another thing that I like about pandas is the ability to quickly generate and represent DataFrames (the default dataset format) in a print-friendly form, like this:
The table above shows p-values for various groups tested, which were calculated and entered into a DataFrame. This DataFrame can then be printed on the screen and copy-pasted into a document.
So why is pandas is efficient? Well, simply because its built on top of numpy, which brings me to my next library of choice…
Numpy is the choice when it comes to large-scale matrix computations in Python. Tensorflow, for instance, requires your data to be converted to numpy’s darray format. One of the reasons numpy is preferred is because at its core it’s just C. The primitives, for example, such as int or bool are closely tied to C code, as a result, you get the performance of C and convenience of Python. This is often achieved due to the fact that ndarrays — the default numpy’s container format, are contiguous in memory and contain a single data type, unlike python’s lists, which can contain different datatypes and are not necessarily contiguous. This makes millions of operations on ndarrays extremely fast. And I haven’t even touched the surface of numpy’s capabilities: ndarrays can also be multidimensional, which is very useful in matrix computations and deep learning.
The fact that pandas uses numpy underneath makes the conversion between them very easy, which I personally do all the time
One of the jokes data scientist make is that every workflow starts with these 2 lines of code, and there is definitely an important reason for that.
sklearn is the library that enables machine learning in Python. This library contains a variety of ML supervised and unsupervised algorithms and data preprocessing tools, which are organized in a very reusable and extendable architecture. You can create your own sklearn classes (called estimators) with your own algorithms very easily: all you have to do is to inherit from
sklearn.base.BaseEstimator and implement
predict, if applicable.
In my work, I found that sklearn covers 99% of my needs when it comes to data analysis. Some things even surprised me. For one of my papers, I needed to implement a binary token matrix for a Wikipedia dump for a faster lookup. This matrix would mark the words that are present in each article. I was delighted to learn that the standard
CountVectorizer, which counts the words in a dataset of texts, also provides the binary argument:
Some other notable features that sklearn provides is pipelines, which simplify and organize your workflow. Overall, sklearn is definitely worth learning and mastering.
SciPy is another library that provides tools for numerical computations and linear algebra. However, the main reason why this library is particularly useful for data analysis, in my opinion, is its statistical capabilities. In particular, it is extremely easy to execute various statistical tests in SciPy:
Another useful feature of SciPy is sparse matrices. A sparse matrix is an efficient representation of a regular matrix. If we have a matrix with a lot of empty elements (for example, zeroes), instead of storing every element in the memory, we can store the addresses of non-empty elements and save a significant amount of space.
The advantages of sparse matrices are particularly useful for binary matrices for a set of documents, since documents rarely contain all the possible words in the dataset. In my work, sparse matrices are immensely useful, saving gigabytes of RAM.
Matplotlib is the main library when it comes to plotting and visualizing your results. The plots can be saved on your disk in various formats and quality-values. One thing I like about this library is how adjustable everything is. For one of my papers I had to combine two box-plots on the same figure. The catch was that the Y-axis had to be completely different for those plots, while both of them had to be visible. While I wouldn’t call these things trivial in matplotlib, they are definitely achievable.
The libraries that I list below are the ones I use a lot, but they are too specialized to be useful for every circumstance.
gensim — is a great library for a general-purpose NLP tasks, such as topic modeling. The library provides a multi-core LDA implementation, which is insanely fast. However, one potential drawback with it is that it is not easily compatible for other libraries and requires conversion of your datatypes to gensim’s format.
seaborn — another great library for visualization. seaborn is built on top of matplotlib, so all the low-level stuff is available from seaborn’s function calls as well. The library is particularly well-optimized for pandas’s DataFrames, outputting truly beautiful graphs, with the ability to customize using different predefined color schemes.
spaCy — another useful library for NLP-tasks. While I don’t have an extensive experience with it, I found its Dependency Parsing capabilities more than adequate.
What are the best libraries for data analysis in Python in your opinion? What libraries do you personally prefer?