Oct 16, 2021

# Top-5 Essential Python Libraries for Data Analysis

Python is one of the best choices when it comes to analyzing data. In fact, Python is THE language when it comes to programming in general. One of the main advantages of Python for data scientists is the abundance of libraries and frameworks for any kind of goal in mind. In this article I present my top-5 data analysis libraries that I regularly use in my workflow.

# Pandas

Every analysis starts with collecting, representing, and storing data. When it comes to representing and storing the data, the priorities have to be *a) access speed b) space efficiency* and *c) ease of use*. In my opinion, **pandas** achieves all of the above. **Pandas **allows massive dataset manipulations without iterating each row individually. This comes especially handy for things like filters:

The code above, from a dataset of app reviews, takes a sample of 1–2 star rated reviews, published before January 1st, 2021.

Another thing that I like about **pandas **is the ability to quickly generate and represent DataFrames (the default dataset format) in a print-friendly form, like this:

The table above shows p-values for various groups tested, which were calculated and entered into a DataFrame. This DataFrame can then be printed on the screen and copy-pasted into a document.

So why is pandas is efficient? Well, simply because its built on top of **numpy**, which brings me to my next library of choice…

# numpy

**Numpy **is the choice when it comes to large-scale matrix computations in Python. **Tensorflow**, for instance, requires your data to be converted to **numpy’s** *darray *format. One of the reasons **numpy **is preferred is because at its core it’s just C. The primitives, for example, such as *int *or *bool *are closely tied to C code, as a result, you get the performance of C and convenience of Python. This is often achieved due to the fact that *ndarrays *— the default **numpy’s **container format, are contiguous in memory and contain a single data type, unlike python’s lists, which can contain different datatypes and are not necessarily contiguous. This makes millions of operations on *ndarrays *extremely fast. And I haven’t even touched the surface of **numpy’s **capabilities: *ndarrays *can also be multidimensional, which is very useful in matrix computations and deep learning.

The fact that **pandas **uses **numpy **underneath makes the conversion between them very easy, which I personally do all the time

One of the jokes data scientist make is that every workflow starts with these 2 lines of code, and there is definitely an important reason for that.

# sklearn

**sklearn **is the library that enables machine learning in Python. This library contains a variety of ML supervised and unsupervised algorithms and data preprocessing tools, which are organized in a very reusable and extendable architecture. You can create your own **sklearn **classes (called *estimators*) with your own algorithms very easily: all you have to do is to inherit from `sklearn.base.BaseEstimator`

and implement `fit`

, `transform`

, and `predict`

, if applicable.

In my work, I found that **sklearn **covers 99% of my needs when it comes to data analysis. Some things even surprised me. For one of my papers, I needed to implement a binary token matrix for a *Wikipedia *dump for a faster lookup. This matrix would mark the words that are present in each article. I was delighted to learn that the standard `CountVectorizer`

, which counts the words in a dataset of texts, also provides the *binary *argument:

Some other notable features that **sklearn **provides is *pipelines*, which simplify and organize your workflow. Overall, **sklearn **is definitely worth learning and mastering.

# SciPy

**SciPy **is another library that provides tools for numerical computations and linear algebra. However, the main reason why this library is particularly useful for data analysis, in my opinion, is its statistical capabilities. In particular, it is extremely easy to execute various statistical tests in **SciPy:**

Another useful feature of **SciPy **is sparse matrices. A sparse matrix is an efficient representation of a regular matrix. If we have a matrix with a lot of empty elements (for example, zeroes), instead of storing every element in the memory, we can store the addresses of non-empty elements and save a significant amount of space.

The advantages of sparse matrices are particularly useful for binary matrices for a set of documents, since documents rarely contain all the possible words in the dataset. In my work, sparse matrices are immensely useful, saving gigabytes of RAM.

# Matplotlib

**Matplotlib **is the main library when it comes to plotting and visualizing your results. The plots can be saved on your disk in various formats and quality-values. One thing I like about this library is how adjustable everything is. For one of my papers I had to combine two box-plots on the same figure. The catch was that the Y-axis had to be completely different for those plots, while both of them had to be visible. While I wouldn’t call these things trivial in **matplotlib**, they are definitely achievable.

# *Honorable mentions

The libraries that I list below are the ones I use a lot, but they are too specialized to be useful for every circumstance.

**gensim **— is a great library for a general-purpose NLP tasks, such as topic modeling. The library provides a multi-core LDA implementation, which is insanely fast. However, one potential drawback with it is that it is not easily compatible for other libraries and requires conversion of your datatypes to gensim’s format.

**seaborn** — another great library for visualization. seaborn is built on top of matplotlib, so all the low-level stuff is available from seaborn’s function calls as well. The library is particularly well-optimized for pandas’s DataFrames, outputting truly beautiful graphs, with the ability to customize using different predefined color schemes.

**spaCy **— another useful library for NLP-tasks. While I don’t have an extensive experience with it, I found its Dependency Parsing capabilities more than adequate.

What are the best libraries for data analysis in Python in your opinion? What libraries do you personally prefer?