Donald Trump, a former US president, is one of the most controversial presidents in the modern US history. His presidency was remembered by the divisive and sometimes borderline offensive rhetoric. On January 8th, 2021 his Twitter account was suspended due to the attack on the US Capitol.
And here’s what Trump’s reaction was:
Wait a second. How did he tweet if his account was suspended? He didn’t actually. The tweet above was generated by Markov Chains! I only supplied the initial word — “twitter” — and the rest was generated automatically.
In these series of articles I will explore state-of-the-art techniques for text generation (part 2). We will start with simple ones and gradually ramp up the complexity. In this article, I will demonstrate how you can use Markov Chains to generate pretty believable Trump’s tweets. The code is available here.
In plain English, you have a set of states and an agent, that jumps between these states with a certain probability. With respect to text generation, each state can be described as a separate word (or n-gram), and each probability — is a probability of another word following that state.
In this contrived example, each state represents a word and each edge represents a possible transition between states, with a certain probability. Suppose we start with the state I and jump to state like. From there, there is a slight off-chance (20%) that we jump to cats and then finish on the dot, resulting in “I like cats.”. On the other hand, we can also transition from like to dogs with 80% chance and then jump to and and cats, resulting in “I like dogs and cats.”.
The bottom line is, there are different combinations available and the “stochastic” part of the Markov Chains definition makes all these sentences non-deterministic and therefore possible. Which makes it especially fun for text generation! So without further ado, let’s get started with tweet generation.
For this project I used a public dump of Trump’s tweets from 2008 up until his ban. The dataset is available here.
We are not particularly interested in all those columns, except for the text column.
Before we continue, it is a good idea to perform some basic cleaning and text preprocessing. For text generation we don’t want to stem or lemmatize the tweets. We want to preserve the original words. But some things, like hyperlinks, or quotations marks are not necessary, so we remove them.
Each tweet in
processed_tweets is a list of tokens. One important part — we want to preserve the punctuation. I will explain later how it will help us to generate more believable sentences. By the way, my code is loosely based on this article, so feel free to check it out later.
Now let’s talk about how our algorithm would work. For Markov Chains, we need to define 2 things: states and transitions. How do we define a state? By the analogy with the previous example, we can define our states as single words from tweets, and transitions as probabilities for each word_i to follow word_i-1. However, such approach will not generate believable tweets because each word_i will not have sufficient context by only considering a single word_i-1. To rectify it, we will define a parameter k that will denote how many words behind we look for before predicting the next word. As a concrete example, here is our original tweet:
'Getting a little exercise this morning!'
And here’s the same tweet with k=2:
As we can see, each state represents each possible context in a tweet. Now, when we generate each possible state for each of the tweets, how do we actually predict new words? Let’s take a look at our example again. What’s the most likely word to follow the state “a little”? Well, there is only one word possible — “exercise”! Our new state is “little exercise”. Now imagine we have another tweet like this:
'A little kitty was crossing a street. A little kitty was scared.'
Now we have the word “kitty” after the state “a little” that appears 2 times, as opposed to the word “exercise”. Naturally, given these 2 tweets, we would predict the word “kitty” after the state “a little”, so the new state is “little kitty”. And that is the core of Markov Chains. You start with a state, predict the most likely word to follow that state and your new state is now that new predicted word plus the previous word. You continue until you have no words to predict, or until you’ve reached the required length. That’s it!
For Markov Chains to work, however, you need a table to look up and count which words would follow which states. So we create a matrix, where rows will represent states, and columns will represent words that follow these states.
The code above is self-explanatory. We create a sparse matrix with n_rows = # of distinct states and n_cols = # of distinct words (in all of the tweets, of course). One little trick: if you create the
dok_matrix using the default constructor — you will run out of memory if you convert it into a numpy matrix. The solution? Use a different dtype! By default, the
dtype is set to
uint16 cuts the space requirements by a factor of 4.
Let’s convert our lists of states and words into dictionaries so we can address specific state-word pairs in our matrix using the words and states themselves as strings
We now ready to fill out our matrix. What should go in there? For each state-word pair, we count the number of times that word followed that specific state. We count over all of our tweets, of course. For example, if we have a state “a little” and the word “kitty”, their respective cell will be equal to 2 (from the previous example). The code below does exactly that:
We are now ready to generate tweets!
sample_next_word_after_sequence takes a state and extracts the corresponding row with word counts. Each count is weighted and converted to probability. Higher counts — higher probabilities. Using those probabilities we generate the next possible word.
stochastic_chain starts with a state and repeatedly calls
sample_next_word_after_sequence to obtain new words. Each new word is appended to the existing chain, and the current state is shifted (transitioned). When the required chain length is reached — the algorithm is terminated and the string is returned. Here is a sample tweet generated by this method:
'exclusive @realdonaldtrump updates ! we can all agree the u .s.a. can give america great ! see'
We can see that the chain is exactly 17 (15 + 2 initial) words long. There are 2 problems with this tweet. First, it ends abruptly, ideally we would want it to end on a period, so that each sentence is complete. Second, it still lacks randomness. The algorithm will only generate existing pieces and snippets of text that are in the tweets. It is not truly random…
Let’s address each of the concerns and modify our algorithm accordingly. How many words each tweet should have? 10? Likely. How about 50? May be… a 100? Not so likely. In fact, we can model the probability of each length easily — we already have all the tweets at our disposal! In what follows I will define a CDF — Cumulative Distribution Function.
CDF evaluates the probability that X will take on values less or equal to x. In other words, as x gets larger and larger, the area “behind” it grows larger as well, thus increasing the probability. At the extreme case — when x is the largest value in our dataset — the probability that it takes on that value or less is 1, because there is no other value our there to the right of it! We will use the same idea to calculate the length of each tweet and terminate it as it grows larger and larger.
The code above sorts all the possible lengths in our dataset in increasing order, calculates a cumulative sum and rescales them into 0–1 range.
Now let’s address the second concern — the lack of randomness. How can we rectify it? Well, the simplest way would be to have a tiny chance of selecting a completely random word out of all available words, instead of picking the words from a predefined set of words that can follow the state. We can define a parameter alpha that will “fire up” with a 0.1% chance every time we need to append a new word and select a random word out of the entire set of words.
With both these things addressed, here’s our final algorithm to generate fake tweets:
The only major difference is the alpha parameter and this line
elif next_word in list(“.!?”):
which uses our CDF to check if it’s time to terminate a sentence. In what follows I will show some funny tweets I was able to generate with this algorithm
Here are some examples with k=3
As we can see, increasing the k can lead to more coherent sentences. At the same time, it limits the number of possible states and as a result requires a larger dataset.
That’s it! In the next article we will talk about more advanced text generation strategies. I hope you like what you saw. Thank you for reading!