Part 2: Analysing Lyrics

As a recap, the two metrics I'm going to target are verbosity and rhyme density, i.e. the fraction of the 20,000 words that are unique, and the fraction which rhyme with each other. The first part is MUCH simpler than the second as we shall see.

2.1 Unique Words

We can do a basic word count pretty easily. The only complication comes from deciding what counts as a word. While this might sound easy, there's actually a significant amount of difficulty that arises from this. As a simple example, are words joined by a hyphen one or two words? The headache can be avoided entirely by using a third party word tokeniser (an algorithm that splits text into words). The most renowned python module out there for this kind of analysis is the Natural Language Tool-Kit (NLTK) package. Note that to use their tokeniser you will have to download some additional things on top of the pip install (see website for documentation). For now, I'm just going to use it in the code.

from nltk.tokenize import word_tokenize

punc = [',','.', '/', '\'', '\"', '\\', '$', ';', ':', '!', '?', '-',  '#', '(', ')']

def word_freq_list(text_file):
    '''Take an input .txt file and perform a unique word count'''
    thefile = open(path+text_file, 'r')
    words = []
    counts = []
    total_words = 0
    for i, line in enumerate(thefile):
        for word in word_tokenize(line.lower()):
            if word not in words:
                if word not in punc:
                    words.append(word)
                    counts.append(1)
                    total_words += 1        
            else:
                counts[words.index(word)] += 1
                total_words += 1

            if total_words>=20000:
                dic = sorted(zip(words, counts), key=lambda x: -x[1])   
                return dic
    return 'Not enough words'

Note that we want to exclude all punctuation and repeats of words with upper and lowercase letters. Therefore we make sure each word isn't inside the punctuation list, and use .lower().

3.2 Finding the Rhyme Density

3.2.1 Identifying Word Sounds

Ok, that was simple enough! Now for the tricky part: rhyme sounds. How do we determine whether two English words rhyme? Well, that's hard. The English language is notoriously difficult with many odd spellings and unintuitive sounds. So... we're not going to worry about it. We'll just use some third party software called eSpeak. Their tool will allow us to convert words into their phonetic spelling, so we can directly compare the sound of words.

You can go on over to their website install and the software. It's a command line tool so make sure to add the eSpeak directory to your system path to allow command line calls. If you're running a Linux machine, a simple

>>> sudo apt-get espeak

should do the trick. Once that's done, we can convert text into its phonetic form with the following call:

>>> espeak -q --ipa -v en-us <text>

We can use the check_output() function within python to access the command line. Here's an example of call and expected output:

from subprocess import check_output

line =  "I'm still not a player but you still a hater,\nElevator to the top hah see you later\n"
line_phonetic = check_output(["espeak", "-q", "--ipa", "-v", "en-us", line]).decode('utf-8')

print('Original line: \n', line)
print('Phonetic Version: \n', line_phonetic)

Original line: 
 I'm still not a player but you still a hater,
Elevator to the top hah see you later

Phonetic Version: 
  aɪm stˈɪl nˌɑːɾə plˈeɪɚ bˌʌt juː stˈɪl ɐ hˈeɪɾɚ
 ˈɛlɪvˌeɪɾɚ tə ðə tˈɑːp hˈɑː sˈiː juː lˈeɪɾɚ

Since we are interested in rhyming sounds (and are going to be lenient about half-rhymes) we're really only interested in the vowel sounds. So the first thing we want to do is take our lyrics, and convert it into a list of ordered vowel sounds, stripped of their consonants. An exhaustive list of all the possible English vowel sounds is shown below.

vowels_sounds = ['aɪ','aʊ','eɪ','i','oʊ','oː','uː','æ', 'ɑː',
                 'ɔ','ɔɪ','ɔː','ɛ','ɛ','ɜː','ɪ','ʊ','ʌ', 'ɚ', 'ə']

Notice that some are two characters long, and sometimes the first character of a two-character vowel sound also exists as its own separate vowel sound. Therefore we can split this into two lists: possible first characters and possible second characters

first = ['a','e','i','o','u','ɐ','æ','ɑ','ɒ',
         'ɔ','ɛ','ɜ','ɪ','ʊ','ʌ','ɚ', 'ə']
second = ['ɪ','ʊ','ː']

We can use these two lists to extract all the vowels from a phonetic string

def extract_vowels(phonetic_text):

    phonetic_vowels = []

    for word in phonetic_text:
        i = 0
        while i < len(word):
            letter = word[i]
            if letter in first:
                vowel = letter
                i += 1
                try:
                    letter = word[i]
                    if letter in second:
                        vowel += letter
                except IndexError:
                    pass

                phonetic_vowels.append(vowel)
            i += 1
    return phonetic_vowels

vowel_list = extract_vowels(line_phonetic.split())
print(line, '\n', line_phonetic, '\n', vowel_list)

I'm still not a player but you still a hater,
Elevator to the top hah see you later

  aɪm stˈɪl nˌɑːɾə plˈeɪɚ bˌʌt juː stˈɪl ɐ hˈeɪɾɚ
 ˈɛlɪvˌeɪɾɚ tə ðə tˈɑːp hˈɑː sˈiː juː lˈeɪɾɚ

 ['aɪ', 'ɪ', 'ɑː', 'ə', 'eɪ', 'ɚ', 'ʌ', 'uː', 'ɪ', 'ɐ', 'eɪ', 'ɚ', 'ɛ', 'ɪ', 'eɪ', 'ɚ', 'ə', 'ə', 'ɑː', 'ɑː', 'iː', 'uː', 'eɪ', 'ɚ']

We now want a function to go through our lyrics and output a vowel list. However, this function allows only a limited number of characters to be sent to the command line. Thus, the text file has to be split up into chunks. My function is as below

def find_phonetic_vowels(text_file):
    '''Strip the text of constanants and output vowell list'''
    with open(path+text_file, 'r') as text_file:
        original_text=text_file.read().replace('\n', ' ')

    n = 30000 #we're limited to about 30,000 characters so have to do it in chunks
    original_text = [original_text[i:i+n] for i in range(0, len(original_text), n)]
    phonetic_text = []
    for chunk in original_text:
        phonetic_text.append(check_output(["espeak", "-q", "--ipa", "-v", "en-us", chunk]).decode('utf-8').split())

    return extract_vowels(phonetic_text)

2.3.2 Sorting out Rhymes

Ok. So for every rapper we now have a huge long list of every vowel sound they use in their first 20,000 words. How can we analyse this in order to say something useful about how densely they are rhyming? Some things to note might be

Repeated words aren't really rhymes. Well, we've tried pretty hard to get rid of repeated lines but in the format I have so far, we've now lost all information about what original words were linked to each vowel sound. This is a bit of a flaw really. If 2 Chainz just shouts 'Money Bitch, Money Bitch!' we won't really have any way of knowing that that's not actually a neat little double rhyme. Sorry... improvements to suggest?
Multi rhymes sound better than single rhymes. 'Multis' as they are knows, and words or phrases that stretch over several syllables, maintaining an identical rhyming pattern. Whatever system we design it should reward lyrics that include long stretches of perfect rhymes
When do you cut-off? Clearly two words separated by 20 lines that happen to rhyme do so by mere chance. How far back should we allow two vowel sounds to be linked?

The process that I have designed goes like this:

In-take a new vowel sound.
Search back through the 30 previous vowel sounds to see if there is a match. If there is no match, move to the next vowel sound.
If there is a match, look at the next vowel sound in the list and find out if the combination of these two vowel sounds appear anywhere in the 30 previous vowel sounds. If they do, add the next vowel and repeat. If they don't simply add the vowel(s) without this extension and continue.
When no more vowels are left, return a list of lists, containing all the vowels and vowel sequences that have been determined to rhyme.

Note that part of this process involves searching for a sub-list within a list. There's no in-built python function for this as far as I could tell, so I had to write my own. It is shown below.

def is_it_in(sub_list, main_list):
    for i in range(len(main_list)):
        if main_list[i] == sub_list[0] and main_list[i:i+len(sub_list)] == sub_list:
            return True
    return False

To put the above algorithm into python was a bit of a headache for me. The function I finally came up with is shown below

def find_rhymes(phonetic_vowels, trail=30):

    phonetic_vowels.append(' ')
    i = 1
    rhymes = []
    total_vowels = len(phonetic_vowels)

    while i < total_vowels:

        vowel = phonetic_vowels[i]
        vowel_trail = phonetic_vowels[:i][-trail:]
        vowels = [vowel]

        if vowel in vowel_trail:
            i += 1
            if i >= total_vowels:
                break
            vowel = phonetic_vowels[i]
            vowels.append(vowel)
            while is_it_in(vowels, vowel_trail):
                i += 1
                if i >= total_vowels:
                    break
                vowel = phonetic_vowels[i]
                vowels.append(vowel)
            rhymes.append(vowels[:-1])
        else:
            i += 1
    return rhymes

If we run this function on the same line as before, we get

rhymes = find_rhymes(vowel_list)
print(rhymes)

[['ɪ'], ['eɪ', 'ɚ'], ['ɪ'], ['eɪ', 'ɚ'], ['ə'], ['ə'], ['ɑː'], ['ɑː'], ['uː'], ['eɪ', 'ɚ']]

Finally, I want some function to designate points based on this output list of lists of rhyme sounds. The function that I decided on allocates more points to longer multis quadratically by

$$ f(\text{length}) = \text{length} \times (1 + \frac{\text{length} - 1}{10} \, ) $$

def points(rhymes):
    score = 0
    for rhyme in rhymes:
        multi = len(rhyme)
        score += multi*(1+(multi-1)/10)
    return score

print(points(rhymes))

13.600000000000001

And there we have it. The final step is simply to loop over all the saved text files and save the scores to a pandas dataframe. See my next post for the results!