An Analysis of Red Hot Chili Pepper’s Lyrics Using NLP

Miroslav Tushev
9 min readApr 7, 2021
source

I heard Red Hot Chili Peppers for the first time about 10 years ago and I instantly became obsessed with them. They were (and still are) my #1 favorite band, right after Metallica, which I loved during my teenage years. Despite the fact that many people define RHCP as a typical “funk” band, their musical style has evolved over the years significantly. Early (pre-1999) RHCP’s songs stood out by their explosive slap bass riffs, while post-1999 Peppers had become more melancholic. Does the same relationship hold for their lyrics though? In this blog I will apply common NLP techniques to answer this question (full code is available at my GitHub).

For this analysis I crawled azlyrics.com and downloaded RHCP’s lyrics, organized by album. This resulted in every album from 1984 to 2016:

# sort albums by year
albums = sorted([d for d in os.listdir(os.curdir) if os.path.isdir(d) and d.endswith(‘)’)],
key=lambda x: re.search(“(\d{4})”, x).group(0))
albums
['The Red Hot Chili Peppers (1984)',
'Freaky Styley (1985)',
'The Uplift Mofo Party Plan (1987)',
"Mother's Milk (1989)",
'Blood Sugar Sex Magik (1991)',
'One Hot Minute (1995)',
'Californication (1999)',
'By The Way (2002)',
'Stadium Arcadium (2006)',
"I'm With You (2011)",
"I'm With You Sessions (2013)",
'The Getaway (2016)']

We’ll begin with typical text preprocessing steps. For this analysis I removed punctuation, converted to lower case, removed English stop-words and lemmatized the words. The reason why I chose lemmatization over stemming is simple: to preserve the meaning of the words so I can examine them later. In projects where the meaning doesn’t matter that much, however, we can apply stemming.

To start off, let’s display the number of unique words per album:

Looks like Stadium Arcadium is the most sophisticated album because it uses the most number of unique words! Or is it? Let’s weight them by the total word length

Let’s visualize them too

Now it doesn’t look so unambiguous anymore. In fact, the unique / total goes down from One Hot Minute to I’m With You! This is interesting, because RHCP produced their best songs during that era, arguably of course. Take a look at their discography rankings on Wikipedia. From Californication on, they absolutely dominated all the charts across different countries. Looks like they didn’t use too many different words to write amazing music though.

By The Way, if you’re wondering how I plotted this — it’s simply seaborn package with ‘fivethirtyeight’ style.

Now let’s take a look at the most used words, by total count, per album

Too bad. Lots of uninformative words, such as “get”, or “know”. Although some songs are clearly recognizable (e.g. “party”, “want”, “pussy” from Special Secret Song Inside). Let’s add a level of sophistication and apply TF-IDF.

For TF-IDF, I concatenate each song into a single string for each album.

tfidf = TfidfVectorizer(preprocessor=None, stop_words=None)
X = tfidf.fit_transform(tfidf_corpus)
X.A.shape
(12, 4007)

We have 12 rows, one per album and 4k unique words. The following code will generate top-5 words with the highest TF-IDF scores, per album:

vocab = {v: k for k, v in tfidf.vocabulary_.items()}
vocab = sorted(vocab.items(), key=lambda x: x[1])
vocab_df = pd.DataFrame(X.A, columns=[c[1] for c in vocab], index=albums)
for i,a in enumerate(albums):
largest = vocab_df.iloc[i].nlargest(5)
RHCP.loc[a, ‘tfidf:word #1’] = largest.index[0]
RHCP.loc[a, ‘tfidf:word #2’] = largest.index[1]
RHCP.loc[a, ‘tfidf:word #3’] = largest.index[2]
RHCP.loc[a, ‘tfidf:word #4’] = largest.index[3]
RHCP.loc[a, ‘tfidf:word #5’] = largest.index[4]
RHCP.iloc[:,3:]

May I say I LOVE the fact that the word “californication” finally appeared on Californication album! The way it should be. Looks like TF-IDF picked up on at least 3 different songs in Stadium Arcadium: Hump de Bump, Readymade, and Torture Me. We also have the word “funky” appearing for Mother’s Milk. Does it mean it’s the funkiest album? Well, I don’t know about you, but Mother’s Milk was the first album that I heard from RHCP back in the day. In fact, if you’re interested, check out my video where I played all the popular bass riffs from that album:

Anyway, TF-IDF clearly helped us to gain some insights about each album. Although useful, we would love for something that would generalize and pick up an overarching theme from each album, and TF-IDF clearly cannot do that. So how about LDA?

The main problem with LDA (or feature, depending on how you look at it) is that LDA needs lots of data to separate topics well. In fact, this paper claims the following:

The number of documents plays perhaps the most important role; it is theoretically impossible to guarantee identification of topics from a small number of documents, no matter how long.

Looks like our 12 albums won’t cut it. So let’s split them back into individual songs and hope for the best.

from gensim.corpora.dictionary import Dictionary
import gensim
lda_corpus = []
for album in lyrics_per_album:
for song in album:
lda_corpus.append(preprocess(song))

Sometimes it’s useful to generate various n-grams for LDA — it just works better this way. “New_York” makes more sense rather than “New” and “York”.

bigram = gensim.models.Phrases(lda_corpus, min_count=5, threshold=100)
trigram = gensim.models.Phrases(bigram[lda_corpus], min_count=5, threshold=100)
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)
lda_corpus = [trigram_mod[bigram_mod[doc]] for doc in lda_corpus]
dictionary = Dictionary(lda_corpus)
bow_corpus = [dictionary.doc2bow(text) for text in lda_corpus]

Let’s naively assume that every album has that “overarching theme” and that LDA can separate them perfectly into topics (I know, I know). We set the num_topics to 12:

from gensim.models.ldamulticore import LdaMulticore
from gensim.models import CoherenceModel
model = LdaMulticore(bow_corpus, num_topics=12, id2word=dictionary, workers=7, passes=10, per_word_topics=True,
random_state=1)
coherence_model = CoherenceModel(model=model, corpus=bow_corpus, texts=lda_corpus, coherence=’c_v’)
print(coherence_model.get_coherence())
0.33568880799556466

Is it a lot? Well, let’s examine the topics themselves:

from pprint import pprintpprint(model.print_topics())[(0,
'0.020*"get" + 0.017*"love" + 0.016*"like" + 0.012*"take" + 0.011*"come" + '
'0.010*"could" + 0.009*"another" + 0.008*"make" + 0.007*"back" + 0.007*"go"'),
(1,
'0.022*"say" + 0.020*"know" + 0.018*"get" + 0.017*"go" + 0.016*"yeah" + '
'0.015*"come" + 0.014*"hey" + 0.013*"want" + 0.011*"oh" + 0.011*"like"'),
(2,
'0.046*"love" + 0.014*"make" + 0.014*"see" + 0.012*"around" + 0.010*"get" + '
'0.010*"want" + 0.009*"know" + 0.009*"way" + 0.008*"never" + 0.008*"go"'),
(3,
'0.022*"get" + 0.017*"know" + 0.014*"come" + 0.013*"want" + 0.013*"girl" + '
'0.012*"go" + 0.011*"like" + 0.009*"friend" + 0.008*"call" + 0.008*"walk"'),
(4,
'0.054*"get" + 0.013*"know" + 0.011*"give" + 0.010*"say" + 0.010*"tell" + '
'0.009*"right" + 0.009*"one" + 0.009*"love" + 0.009*"come" + 0.009*"make"'),
(5,
'0.029*"around" + 0.023*"look" + 0.018*"please" + 0.011*"know" + 0.010*"get" '
'+ 0.009*"way" + 0.009*"never" + 0.009*"torture" + 0.009*"far" + '
'0.008*"like"'),
(6,
'0.026*"get" + 0.020*"turn" + 0.014*"well" + 0.012*"yeah" + 0.009*"yes" + '
'0.009*"sale" + 0.008*"like" + 0.008*"blood" + 0.008*"true_men_kill_coyote" '
'+ 0.008*"know"'),
(7,
'0.014*"love" + 0.013*"time" + 0.012*"like" + 0.012*"oh" + 0.012*"get" + '
'0.011*"know" + 0.011*"come" + 0.010*"go" + 0.009*"sing" + 0.008*"want"'),
(8,
'0.020*"know" + 0.017*"get" + 0.015*"let" + 0.011*"time" + 0.010*"go" + '
'0.009*"man" + 0.009*"like" + 0.008*"say" + 0.008*"want" + 0.008*"girl"'),
(9,
'0.020*"get" + 0.015*"long" + 0.014*"make" + 0.013*"come" + 0.012*"gon_na" + '
'0.012*"time" + 0.009*"fall" + 0.009*"baby" + 0.009*"know" + 0.008*"wo"'),
(10,
'0.028*"want" + 0.024*"party_pussy" + 0.021*"yeah" + 0.020*"baby" + '
'0.019*"good" + 0.013*"get" + 0.011*"say" + 0.008*"make" + 0.008*"take" + '
'0.007*"god"'),
(11,
'0.029*"away" + 0.018*"take" + 0.017*"like" + 0.016*"get" + 0.015*"know" + '
'0.012*"make" + 0.011*"never" + 0.009*"thing" + 0.008*"see" + 0.008*"say"')]

I don’t know about you, but I cannot figure out which one is which. It’s definitely not separating them based on the albums. Rather, it’s separating them based on the songs again — take a look at topic 10 for example.

LDA has 3 main hyperparameters: number of topics, alpha, and eta. Let’s tune them and see if we can get something meaningful. I use skopt for hyperparameter tuning, with Bayesian optimization using Gaussian Processes. We’ll do 100 iteration and see if we can improve on that coherence score.

from skopt.space import Real, Integer, Categorical
from skopt.utils import use_named_args
from skopt import gp_minimize
space = [
Integer(3, 20, name=”num_topics”),
Real(0.01, 1, name=’alpha’),
Real(0.01, 1, name=’eta’),
]
@use_named_args(space)
def objective(**params):
model = LdaMulticore(bow_corpus, id2word=dictionary, workers=7, passes=30, per_word_topics=True,
random_state=1,
num_topics=params[‘num_topics’],
alpha=params[‘alpha’],
eta=params[‘eta’])
coherence_model = CoherenceModel(model=model, corpus=bow_corpus, texts=lda_corpus, coherence=’c_v’)
return -coherence_model.get_coherence()
res_gp = gp_minimize(objective, space, n_calls=100,n_jobs=-1, verbose=True)print(“Best coherence: {:.2f}”.format(res_gp.fun))
print(“Best parameters: {}, {:.2f}, {:.2f}”.format(res_gp.x[0], res_gp.x[1], res_gp.x[2]))
Best coherence: -0.58
Best parameters: 16, 0.85, 1.00

Well, looks like RHCP cover more than 12 topics in their songs (duh…). Also, our coherence score went up! Let’s train the model with these parameters and visualize our results

I use t-SNE for visualization, which compresses multi-dimensional data into the required # of dimensions (in our case — 2). Each color represents a separate topic. It looks like one category dominates a lot! Let’s plot them by their frequency

Category 0 and 4 are the most common. What are their corresponding words?

(0,'0.016*"know" + 0.016*"get" + 0.015*"love" + 0.014*"like" + 0.011*"make" + '
'0.011*"want" + 0.010*"say" + 0.010*"come" + 0.009*"go" + 0.009*"take"')
(4,'0.032*"get" + 0.009*"time" + 0.008*"love" + 0.007*"come" + 0.007*"give" + '
'0.006*"know" + 0.006*"good" + 0.006*"tell" + 0.006*"say" + 0.005*"right"')

Hmm… it still doesn’t look too different to me. May be ~190 documents is still too small of a number for LDA to separate topics. In fact, when I don’t set the random state, every time LDA generates a different set of topics, which suggest that ~190 documents is not enough for it to converge. Finally, let’s plot the topics for each album and see if we can spot any trends

Looks like early Peppers were pretty diverse in terms of their lyrics. We can see more categories for the early albums, and only 2 for their The Getaway.

Here are some other topics LDA generated:

(11,
'0.003*"king" + 0.003*"minor" + 0.003*"repeat" + 0.002*"backwoods" + '
'0.002*"yertle_turtle" + 0.002*"otherside" + 0.002*"throw" + '
'0.002*"television" + 0.002*"turtle" + '
'0.002*"dream_californication_dream_californication"')
(13,
'0.004*"blood" + 0.004*"well" + 0.003*"true_men_kill_coyote" + 0.003*"yeah" + '
'0.002*"feel" + 0.002*"true_men" + 0.002*"gon_na" + 0.002*"hollywood_hill" + '
'0.002*"dig_dirt_dig_dust" + 0.002*"paisley_dragon_hollywood_hill"')
(2,
'0.002*"stretch" + 0.001*"love" + 0.001*"dirty" + 0.001*"stain" + '
'0.001*"bird" + 0.001*"thirty" + 0.001*"earthworm" + 0.001*"burping" + '
'0.001*"chirp" + 0.001*"curb"')

Finally, let’s check one theory. When Anthony Kiedis was discussing the recording of Stadium Arcadium album, he said the following about their hit “Dany California” (watch from 0:30):

Basically, when writing “Dani California”, Anthony realized that it’s actually a continuation of a story he started earlier in “Californication” (1999) and continued in “By The Way” (2002). Let’s see if these 3 songs share any similarities in terms of lyrics!

scat_new = scat.copy()
scat_new.loc[:, ‘Label’] = “None”
scat_new.loc[89, ‘Label’] = ‘Californication’
scat_new.loc[109, ‘Label’] = ‘By The Way’
scat_new.loc[131, ‘Label’] = ‘Dani California’
fig, ax = plt.subplots(figsize=(10,10))
ax = sns.scatterplot(data=scat_new, x=’x’, y=’y’, hue=’category’, legend=None, s=100)
for row in scat_new.iterrows():
if row[1][‘Label’] != “None”:
ax.text(row[1][‘x’]+.02, row[1][‘y’], str(row[1][‘Label’]))
ax.set(xlabel=’’, ylabel=’’)
plt.show()
plt.clf()

They couldn’t be further apart. With that I will conclude my analysis. Please comment down below, what other techniques might have given better results. Thank you for reading!

--

--

Miroslav Tushev

CS PhD @ LSU. Passionate about statistics, ML, and NLP.