This is a part 2 of my series of articles on generating fake Trump tweets. You can read part 1 here where I used Markov Chains to generate fake tweets.
In this article we are going to use LSTM for even better results. LSTM (Long Short Term Memory) is a special case of RNN with the ability to memorize long-term dependencies by preserving the state across cells. You can read more about how it actually works in this excellent article (seriously, it’s amazing, go read it). LSTM is widely used in a variety of tasks, such as translation, speech recognition, time series predictions, and text generation, which we will do in this article.
The preprocessing steps are identical to those from the previous part. Namely, we remove hyperlinks, special symbols, and tokenize the tweets.
The length of preprocessed tweets ranges from 0 to 100 tokens, with the majority of them concentrated around 20–30 mark.
Before feeding the data to our NN, we need to generate sequences and labels. The core idea behind text predictions is the following: given a sequence of n words, what is the most likely word to follow? To realize that idea, similarly to what we did for Markov Chains, we are going to generate sequences of tokens and their corresponding labels.
Keras provides a variety of ways to manipulate textual data. In this case, we use its
Tokenizer to first generate unique ids for each token, and then iterate over our tweets to generate all possible sequences of ids starting from the beginning of a tweet.
The labels are simply the last word in a sequence (prediction). Here is what a slice of our data looks so far:
[212, 6] 92
[212, 6, 92] 26
[212, 6, 92, 26] 380
[212, 6, 92, 26, 380] 1079
[212, 6, 92, 26, 380, 1079] 25
For example, given that we’ve seen the sequence
[212, 6, 92], the most likely token to follow that sequence is
For the algorithm to work, each sequence has to be of the same length. In practice that means we have to pad the sequences with zeros.
array([[ 212, 0, 0, ..., 0, 0, 0],
[ 212, 6, 0, ..., 0, 0, 0],
[ 212, 6, 92, ..., 0, 0, 0],
[4996, 403, 104, ..., 0, 0, 0],
[4996, 403, 104, ..., 0, 0, 0],
[4996, 403, 104, ..., 0, 0, 0]])
Labels have to be converted to one-hot representations too… however, the size of such matrix would be huge: total # of examples times # of unique tokens.
MemoryError: Unable to allocate 255. GiB for an array with shape (1188946, 57668) and data type float32
To mitigate that, you can save a lot of space by adding an optional
dtype=’int8' argument. But I will take a different approach by changing the loss function.
We are now ready to define our model
The first layer is
Embedding layer — it takes a sequence and outputs an embedding in the shape of
(99, 256). Each token is represented by a vector of 256 numerical values.
Mask_zero is used to tell the layer that zeroes were added for padding.
LSTM is a layer with 256 units, and accepts a 3d tensor as an input in the form of
(batch, timestep, features). With
Embedding layer added, the input of
LSTM layer will be
(4096, 99, 256). I added
dropout rate of 0.3 here, however, it is not necessary. The
stateful is set to False so we don’t preserve the state between batches. Set it to True when your text is continuous, so that your current sequences are related to your previous ones.
The final layer outputs the probability of each word to follow a given sequence.
Note that by using
loss=’sparse_categorical_crossentropy’ we don’t have to one-hot encode our
y anymore, which saves a lot of space.
adam is a great optimizer for a variety of tasks. We’ll train our model for 30 epochs with 4096 batch size. This model has 30,108,996 trainable parameters and it takes about 30 minutes to train on my machine with enabled GPU training.
Once the model is trained we can generate our predictions
The function above takes an initial seed and required length and predicts the most likely token based on the current sequence, updates the sequence with the new tokens, and runs the prediction again.
id_to_word is a reverse-lookup dictionary for tokens.
Let’s see what Mr. President has to say!
democrat senators are doing a great job . i am notdemocratic states , the democrats are not going to be a total disaster .republican senators have a great job for the great state of texas . he will be a great governor ! #maga #kag and , @senatorheitkamp. and , others , the people gop senators must stop the flights from the united states . obama ’s campaign is a total disaster . biden has been a total disaster . i will be back soon ! #maga #kag #tcot @foxbusiness oh well , i ’m not going to be a total mess .
Only NN can reveal his deep admiration for democrats. I’m glad that we kept punctuation as separate tokens. It looks like coherent sentences with it.
Below I trained a slightly different model to see what happens.
Here’s a new set of predictions
republican senators are working hard to get the job done in the senate . we have a great state and , great healthcare ! we need strong borders and crime !obama is a disaster for the people . he is a disaster . he is a great guy . he is a winner . he is a winner . he is a winner . he is a winner . he is a great guy and a great guy . he will be missed !
bernie sanders is lying to the people of the united states . he is a total mess . he is a total mess . he is a total mess . he is a total mess ! he is a total mess ! he is a corrupt politician ! a total witch hunt ! no collusion , no obstruction . the dems don ’t want to do it . he is a corrupt politician ! he is a corrupt politician ! he is a corrupt politician ! he is strong on crime , borders , and , the enemy of the people !democrats stole election results . they are a disgrace to our country , and , we will win ! gop senators are working hard on the border crisis . the dems are trying to take over the border . they are now trying to take away our laws . biden will bring back our country , and we are going to win the great state of texas . we need you in a second election .
While the predictions are good enough, here are a few areas we can improve upon:
- Garbage in — garbage out. 90% of success stems from good data. A more careful preprocessing can be done. For instance, you can try to remove hashtags since I found that predictions always go into a “vicious circle” of hashtags when the model doesn’t know what to predict. It will simply output a ton a unrelated hashtags, which obviously doesn’t have a lot of value. Another thing to try would be drop tweets with too low or too high length.
- Model architecture. I was hoping to achieve better results with a deeper NN with less units, but apparently shallower, wider NN worked better for me. You can experiment with the # of layers, # of units, and dropout rate.
- Replace Embedding layer with the actual word embeddings trained on your dataset, such as Glove or Word2vec.