Faster Word Co-Occurrence Calculation In Large Document Corpus

Topic Modeling

NPMI

Not all topics are useful. LDA, in particular, is known to produce garbage topics for short, semantically-poor documents. How do we know if a topic is good or not? One way to tell is to calculate topic coherence. There are different coherence metrics out there, but in this article we will focus on one of the most used one: Normalized Pointwise Mutual Information.

  1. How many documents wj occurs in;
  2. How many documents both wi and wj occurs in.
Notice how vanilla NPMI has a constant space complexity, while memoing uses approx. 80KB of space for 10 words.

Introducing Matrices

Matrices are everywhere. Modern hardware has become adapted to fast matrix manipulations and computations. We can exploit that to speed up our NPMI computation even more.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store