Faster Word Co-Occurrence Calculation In Large Document Corpus
For one of my research papers I had to calculate co-occurrence information between pairs of words for a large number of topics in order to measure topic coherence. I soon realized that a straight-forward approach would take ages to complete. In this article I will explain how I was able to decrease the computation time from days to just under 10 minutes. The complete code for this article can be found here.
In computational linguistics, word co-occurrence is a well-known concept. It essentially expresses the idea that if two words occur close to each other (e.g., in the same document), they are most likely to be related. This information can then be used to draw some useful conclusions about the language and its structure.
In my particular case, I used word co-occurrence in my research paper to calculate topic coherence for my topic models I derived for user review corpora. This paper ended up being accepted for publication in ICSE’22, which I am extremely proud of (for those who are interested, the code can be found here). The point of this medium article is not to discuss that paper, but rather to compare different optimization strategies for calculating word co-occurrence I tried, to see what worked best.
Essentially, each topic model derives latent topics from a corpus of documents. Each topic is commonly represented by top-10 most “significant” words that are likely to describe a particular topic. Each document is then assigned a probability of each particular topic. For instance, if I try to model a corpus of user reviews for food delivery apps, I might produce a topic like this:
drive, driver, food, minute, around, house, away, street, go, car
This tells me that the topic is most likely to describe user reviews that complain about location and time of their delivery. We then can use this topic to list the reviews that have the highest probability of this topic to dive deep and analyze what problems related to that topic users experience.
Not all topics are useful. LDA, in particular, is known to produce garbage topics for short, semantically-poor documents. How do we know if a topic is good or not? One way to tell is to calculate topic coherence. There are different coherence metrics out there, but in this article we will focus on one of the most used one: Normalized Pointwise Mutual Information.
The deep meaning behind this formula is not important for now. What is important is that we have two words: wi and wj, and three pieces of information that we need to collect from a document corpus:
- How many documents wi occurs in;
- How many documents wj occurs in;
- How many documents both wi and wj occurs in.
NPMI, for a specific topic, is calculated as a pairwise metric, that is between each pair of words. In total, we have 45 pairs for 10 words. These pairs are then averaged out to produce the final metric. The most straight-forward approach to calculate this can be implemented as follows:
Here we loop over our document corpus (from now on we will use an official Wikipedia dump as our document corpus, consisting of the first 100k articles) and perform 3 if-tests. The time complexity of this approach is represented by a triangular sequence, as we increase the number of words in our topics, because we have to recount the number of documents for a particular word every time it appears in a word pair. We can immediately notice that some of this is redundant and we can save time by using memoing like so:
We now have a dictionary
keywords_countswhich calculates and stores the indices of documents that contain a particular word. Additionally, to count the documents that contain both words, we apply a intersection set operation for two lists of indices. This approach essentially trades time complexity for space complexity, compared to the previous approach.
Time is money, while memory is cheap for a modern computer (relatively speaking). The approach with memoing clearly is a better choice. But can we do even better?
Matrices are everywhere. Modern hardware has become adapted to fast matrix manipulations and computations. We can exploit that to speed up our NPMI computation even more.
For starters, let’s transform our original Wikipedia document corpus into a matrix, where each row would represent a document and each column would represent a word. As we don’t need to know how many times a word occurs in a document, we can use a binary matrix for this task.
Scikit-learn’s implementation automatically constructs a compressed matrix, which saves a significant amount of space, but takes somewhat longer to retrieve individual rows or columns. We will test both compressed and uncompressed (called dense) formats. The new NPMI implementation looks like this:
Because each column represents a word and each document represents a row with 1s denoting if a particular word is present, calculating how many documents a word is in is trivial: simply find the index of the word, extract the column and calculate the sum of 1s in that column. For calculating the documents that contain both words I used a similar idea from the previous implementation, that is, taking an intersection of both columns. We also use memoing to avoid finding the index for a word every time. Here are the results for both sparse and dense matrices:
We can see that dense implementation takes close to 0 seconds regardless of the number of word combinations, while sparse implementation’s time grows steadily to 3.5 seconds, as we need to convert sparse to dense format every time. However, look at the right y-axis: while sparse takes under 1 GB of space to store its matrix, dense uses 50 GB! Clearly, the speed penalty that the sparse matrix imposes does not outweigh the enormous space requirements to store a dense matrix in memory.
And here is how these approaches look in comparison to the vanilla implementation.
In this case, the obvious winner is the sparse matrix implementation, which provides excellent computation time with a small space penalty. When I computed NPMI for this article, I only used first 100k documents from the Wikipedia dump, simply because I couldn’t fit the whole dump in my memory (I have 64 GB of RAM). But now, thanks to the sparse implementation, I can calculate NPMI over the whole Wikipedia dump! In fact, that’s exactly what I did in my research paper.
Even when increasing the size of the corpus document to incorporate all Wikipedia, the size grows to modest 12 GB, while the running time remains manageable.