Rap Skills Analysis: A Computational Approach part 1

Rap Skills Analysis, a Computational Approach

Get Rich or Die Pying

Introduction

I am a self-confessed rap fan. I say self-confessed because part of me dreads mentioning this fact to people. As an overly privileged white dude from the home counties I understand the initial reaction of people sometimes. But listen, some beats just bang and I've always felt strangley at home with the music of hip hop, despite being very much a tourist in the culture. Maybe it was escapism from my decidedly comfortable suburben existence but over the last 10+ years hip hop has always been a part of my life.

You might also be aware that I am a mssive nerd.

In this post I want to try marry two seemingly disconnected worlds, that of hip hop and computer programming, in an attempt to gain some interesting data insights. The question I set out to answer was this: would it be possible to come up with a consistent statistical way to rank rappers’ ability? Of course, art is a totally subjective experience. Even skynet could never fully appreciate the genius of Lil' Waynes infamous line

"Real G's move in silence, like lasagne"

And what exactly contitutes ability anyway? Can it be extracted mathematically? As much as computer power is advancing at a rapid rate, a true understanding of art is still a long way off.

But screw it let's try anyway.

I chose two relatively simple data points to get a sense of a rap artist’s skill. These are:

  1. The number of unique words used in a block of 20,000 words taken from the rap artist’s lyrics
  2. The density of rhyming sounds the rap artist uses in the same 20,000 words

These both have the benefit of being quantifiable and directly comparable across artists. I will walk you through my process of extracting and displaying this data with the programming language python 3 and thus assume some knowledge of python syntax and functionality, as well as basic command line and system know-how. Despite this, I am an intermediate programmer myself so there will doubtlessly be many many areas for improvement and optimisation. All suggestions and comments in this regard would be greatly appreciated.

However, if, like most normal people, you neither know nor care about programming you can skip straight to the end where the results are displayed.

Part 1: Finding lyrics

1.1 Creating a master list of rappers

The first task in this process is to select the list of rappers you want to consider and download their lyrics en-mass. I wanted to have a fairly comprehensive list of well-known rappers without having to recall every one by name. So, the first place I looked was the Wikipedia page ‘List of hip hop musicians’[link]. This has a pretty exhaustive list of over 1000 rap artists with Wikipedia pages. The first function I wrote simply goes to this page, downloads this data and outputs a list of all the rap artists’ names using the requests and BeautifulSoup modules to parse the HTML webpage. (Note, all packages I use are either in-built to python can be added easily with a simple pip install).

We start by obtaining the Wikipedia URL, using requests to grab the HTML, then using BeautifulSoup to create a BS object, using the key "a" to get the titles. If you're not familiar with BeautifulSoup don't worry - neither was I. It basically takes a lump of raw HTML data and converts it into a nice custom object that allows you to access things quickly within it. They have some nice docs online which I suggest you read if you are unsure. Next, we loop through the titles, extracting all the artist names and appending them to a list. The code I have was just a bit of trial and error. It's not the most pretty, but it gets the job done. Our function outputs a nice python list containing strings with all the artist names in.

from bs4 import BeautifulSoup
import requests

def master_list():
    ''' Call this function to return a python list containing a large 
        master list of rap artists names'''

    wiki_url = 'https://en.wikipedia.org/wiki/List_of_hip_hop_musicians'
    html = requests.get(wiki_url)
    html_bs = BeautifulSoup(html.text, "html.parser") 
    artists_raw = html_bs.find_all("a")
    artists_started = False
    artists = []

    for artist in artists_raw:
        try:
            title = artist['title']
            if title == '100 Kila':    #First rapper on the list
                artists_started = True
            if artists_started:
                if title[:4] != 'Edit' and title != 'Enlarge':    #Messy
                    if '(' in title:
                        title = title[:title.index('(')-1]
                    artists.append(title)
            if title == 'Zico':    #This is the last rapper on the list
                break
        except:
            pass

    return artists

From this list, I chose a selection of rappers that I knew personally and thought were worth including. Please forgive me if I have omitted any big names! I did this manually, copying and pasting the master list, then removing rappers I didn't want to include. Laborious maybe, but I wanted to have control over each rapper in the list, making sure all my personal favourites made it on. This left the following

my_artist_list = ['2 Chainz','21 Savage','50 Cent','Andre 3000','Big Boi','Ab-Soul','Abstract Rude','Ace Hood','Action Bronson',
'Aesop Rock','Afrika Bambaataa','Afroman','Akala','Akon','Anderson .Paak','Andre 3000','Angel Haze','A$AP Ferg','A$AP Rocky',
'Asher Roth','Astronautalis','Awol One','Awkwafina','Azealia Banks','Big Daddy Kane','B-Real','B.o.B','Benzino','Beastie Boys',
'Big Boi','Big K.R.I.T.','Big L','Big Pun','Big Scoob','Big Sean','Billy Woods','Birdman','Bishop Nehru','Bizarre','Bizzy Bone',
'BJ the Chicago Kid','Bobby Shmurda','Bone Crusher','Boosie Badazz','Brotha Lynch Hung','Brother Ali','Bubba Sparxxx','Bun B',
'Busdriver','Busta Rhymes','Cage','Canibus','CeeLo Green','Ceschi','Chamillionaire','Chance the Rapper','Chiddy Bang','Chief Keef',
'Childish Gambino','Chino XL','Chris Brown','Chris Webby','Classified','Common','Coolio','Count Bass D','Crooked I','D\'Angelo',
'Dan Bull','Danny Brown','Dappy','Daveed Diggs','Del the Funky Homosapien','Denzel Curry','Desiigner','Devin the Dude','Devlin',
'Dizzee Rascal','Dizzy Wright','DMX','Dr. Dre','Drake','E-40','Earl Sweatshirt','Eazy-E','El-P','Elephant Man','Eminem','Eve',
'Eyedea','Fetty Wap','Flavor Flav','Flo Rida','Flying Lotus','Fort Minor','Foxy Brown','Frank Ocean','Freddie Gibbs',
'French Montana','Funkmaster Flex','Future','The Game','Gangsta Boo','Ghostface Killah','Giggs','Grandmaster Caz',
'Grandmaster Flash','Greydon Square','Grieves','Gucci Mane','Gudda Gudda','Guilty Simpson','GZA','Hemlock Ernst','Heems',
'Hodgy Beats','Hopsin','Ice Cube','Ice-T','Iggy Azalea','Immortal Technique','Isaiah Rashad','J. Cole','Ja Rule','Jadakiss',
'Jaden Smith','Jam Master Jay','Jam Baxter','Jarren Benton','Jay Electronica','Jay Rock','Jay Z','Jean Grae','Jeremiah Jae',
'Jeremih','Jme','Joe Budden','Joell Ortiz','Joey Badass','Juelz Santana','Juicy J','Lauryn Hill','Kanye West','Kendrick Lamar',
'Kevin Gates','Kid Cudi','Kid Ink','Kid Rock','Killer Mike','Kodak Black','Victor Vazquez','Kool G Rap','Kool A.D.','Kool Keith',
'Krayzie Bone','Krizz Kaliko','KRS-One','Lil\' Kim','Lil Jon','Lil Uzi Vert','Lil Wayne','Lil Yachty','Little Simz','LL Cool J',
'Lloyd Banks','Logic','Louis Logic','Lowkey','Ludacris','Lupe Fiasco','M.I.A.','Mac Lethal','Mac Miller','Machine Gun Kelly',
'Macklemore','Masta Ace','Masta Killa','Master P','MC Ren','Meek Mill','Method Man','MF Doom','MF Grimm','Mick Jenkins','Milo',
'Mims','Missy Elliott','Mos Def','Mr. Muthafuckin\' eXquire','Mr. Porter','Murs','Nas','Nate Dogg','Nelly','Nicki Minaj',
'Nitty Scott, MC','Noname','Nocando','The Notorious B.I.G.','Obie Trice','Oddisee','OG Maco','Ol\' Dirty Bastard',
'Open Mike Eagle','Q-Tip','Pharoahe Monch','Pharrell Williams','Phife Dawg','Pimp C','Pitbull','P.O.S.','Prince Paul',
'Professor Green','Proof','Prozak','Pusha T','Quasimoto','R. Kelly','R.A. the Rugged Man','Raekwon','Rakim','Ras Kass','Redman',
'Rich Homie Quan','Rick Ross','Riff Raff','Rittz','Roc Marciano','Roots Manuva','Royce da 5\'9\"','RZA','Snoop Dogg','Sadistik',
'Sage Francis','Scarface','Schoolboy Q','Sean Paul','Sean Price','Sho Baraka','Sir Mix-a-Lot','Slick Rick','Slug','Snoop Dogg',
'Snow Tha Product','Soulja Boy','SpaceGhostPurrp','Sticky Fingaz','Swizz Beatz','SwizZz','SZA','T.I.','T-Pain','T.O.P',
'Taio Cruz','Talib Kweli','Tech N9ne','Timbaland','Tinie Tempah','Too Short','Tory Lanez','Travis Scott','Traxamillion',
'Trey Songz','Trick-Trick','Tupac Shakur','Twista','Twisted Insane','Ty Dolla Sign','Tyga','Tyler, The Creator','U-God',
'Vast Aire','Vic Mensa','Vince Staples','Violent J','Xzibit','Waka Flocka Flame','Wale','Watsky','Will Smith','Will.i.am',
'Wiz Khalifa','Wrekonize','XXXTENTACION','Young Jeezy','Yelawolf','Young Buck','Young Thug','Your Old Droog','Yukmouth',
'Yung Lean','Zebra Katz']

1.2 Accessing lyrics

Now comes the slightly more complex task of downloading lyrics for every artist listed. To do this we're going to use the website Genius.com, by far the largest lyric database online. They actually have an API which you can use, however it looked like it was more trouble than it was worth and I couldn't get hold of a developer's authentication token. The benefit of Genius is that it is very extensive and fairly standardised. The draw back is it's a flashy website with a lot of javascript increasing load times and making it a little difficult to navigate in python. It is a bit slow, but using just requests and BeautifulSoup it's possible to access each artist's page and download their lyrics.

The first step then is to find the correct URL for each artist. With genius, this is relatively simple since the artist page URL tends to follow a very simple formula: https://genius.com/artists/Artist-name where the first letter of the artist's name is always capitalised and the rest is lower-case, multiple words are joined with a hyphen, and any apostrophes and full stops are removed. Examples would be:

  • https://genius.com/artists/Meek-mill
  • https://genius.com/artists/Method-man
  • https://genius.com/artists/Mf-doom

Our first function therefore converts the names as written in the above list into the correct artist URLs on Genius.

def name_to_url(name):
    '''Take an artists name and output the link to their genius page'''

    base_url = 'https://genius.com/artists/'
    name = name.replace('.', '')    #remove apostrophes and full stops
    name = name.replace('\'', '')
    split_name = name.split()
    name_new = split_name[0]
    name_new = name_new[0] + name_new[1:].lower()    #ensure first letter of first word is capitalised
    for word in split_name[1:]:
        name_new= name_new + '-' + word.lower()    #add words together with hyphen

    return base_url+name_new

my_list = [print(name_to_url(name)) for name in ['KRS-One','LL Cool J','MF Doom'] ] 
https://genius.com/artists/Krs-one
https://genius.com/artists/Ll-cool-j
https://genius.com/artists/Mf-doom

I found the best way to access individual songs from here was through the albums linked from this page. This is because unfortunately, without messing around with the javascript, you only have access to a few songs directly from this page, whereas you can directly access six albums, where you can link directly to all the songs. So my next step was to find links to all albums possible.

from bs4 import BeautifulSoup
import requests

def get_albums(artist_url):
    html = requests.get(artist_url)
    html_bs = BeautifulSoup(page.text, "html.parser")
    [h.extract() for h in html_bs('script')]
    top_albums = html_bs.find_all("a", class_="vertical_album_card")
    urls = [album['href'] for album in top_albums]   
    return urls

# Test: Find the links to the 6 most recent MF Doom albums on Genius 
MF_Doom_albums = get_albums(name_to_url('MF Doom'))
for album_link in MF_Doom_albums:
    print(album_link)
https://genius.com/albums/Mf-doom/Born-like-this
https://genius.com/albums/Mf-doom/Metalfingers-presents-special-herbs-the-box-set-vol-0-9-disc-2
https://genius.com/albums/Mf-doom/Metal-fingers-presents-special-herbs-the-box-set-vol-0-9-disc-1
https://genius.com/albums/Mf-doom/Mm-food
https://genius.com/albums/Mf-doom/Special-blends-volume-1-2
https://genius.com/albums/Mf-doom/Dead-bent-doomsday

Next, given an album url we want to extract the urls and titles of all the songs on that album

def get_songs(album_url):
    '''Take an album URL link and find all the song titles and their related URLs'''
    html = requests.get(album_url)
    html_bs = BeautifulSoup(html.text, "html.parser")
    songs = html_bs.find_all("a", class_="u-display_block")
    song_urls = [song['href'] for song in songs]  #href to song
    titles = [song.get_text() for song in html_bs.find_all("h3", class_="chart_row-content-title")]
    return song_urls, titles

# Test: Print all the titles and URLs for the most recent MF Doom album on Genius
doom_urls, doom_titles = get_songs(MF_Doom_albums[3])
for url, title in zip(doom_urls, doom_titles):
    print(url, title[:title.index('Lyrics')])
https://genius.com/Mf-doom-beef-rapp-lyrics 
              Beef Rapp

https://genius.com/Mf-doom-hoe-cakes-lyrics 
              Hoe Cakes

https://genius.com/Mf-doom-potholderz-lyrics 
              Potholderz (Ft. Count Bass D)

https://genius.com/Mf-doom-one-beer-lyrics 
              One Beer

https://genius.com/Mf-doom-deep-fried-frenz-lyrics 
              Deep Fried Frenz

https://genius.com/Mf-doom-poo-putt-platter-lyrics 
              Poo Putt Platter

https://genius.com/Mf-doom-fillet-o-rapper-lyrics 
              Fillet-O-Rapper

https://genius.com/Mf-doom-gumbo-lyrics 
              Gumbo

https://genius.com/Mf-doom-fig-leaf-bi-carbonate-lyrics 
              Fig Leaf Bi-Carbonate

https://genius.com/Mf-doom-kon-karne-lyrics 
              Kon Karne

https://genius.com/Mf-doom-guinnesses-lyrics 
              Guinnesses (Ft. 4ize & Stahhr)

https://genius.com/Mf-doom-kon-queso-lyrics 
              Kon Queso

https://genius.com/Mf-doom-rapp-snitch-knishes-lyrics 
              Rapp Snitch Knishes (Ft. Mr. Fantastik)

https://genius.com/Mf-doom-vomitspit-lyrics 
              Vomitspit

https://genius.com/Mf-doom-kookies-lyrics 
              Kookies

https://genius.com/Mf-doom-mm-food-tracklist-album-artwork-annotated 
              MM.. FOOD [Tracklist + Album Artwork]

The next step is to get the raw text of the lyrics for each song. One thing to note here is that you must be running python3 for the following to work as expected. This is because of python3's handling of unicode strings. It's possible to do in python2 but you will have to play around with converting the text from unicode to ASCII.

def get_lyrics(song_url):
    '''take a song url and gather the raw text containg the lyrics for that song'''
    html = requests.get(song_url)
    html_bs = BeautifulSoup(html.text, "html.parser")
    [h.extract() for h in html_bs('script')]
    lyrics = html_bs.find("div", class_="lyrics").get_text() 
    return lyrics

# Test: Find the lyrics to the first track on the latest Doom album on Genius
doom_lyrics = get_lyrics(doom_urls[12])
print(doom_lyrics)
[Hook 2X: Mr. Fantastik ]
Rap snitches, telling all their business
Sit in the court and be their own star witness
Do you see the perpetrator? - Yeah, I'm right here
F*** around, get the whole label sent up for years

[Verse 1: Mr. Fantastik]
Type profile low, like A in Paid in Full
Attract heavy cash 'cause the game's centrifugal
Mister Fantastik, long dough like elastic
Guard my life with twin Glocks that's made out of plastic
Can't stand a brown nosing n**** fake a** b******
Admiring my style, tour bus through Manhattan
Plotting, plan the quickest, my flow's the sickest
My h*** be the thickest, my dro the stickiest
Street n****, stamped and bonafide
When beef jump n****s come get me 'cause they know I ride
True to the ski mask, New York's my origin
Play a fake gangsta like a old accordion
According to him, when the D's rushed in
Complication from the wire testimony was thin
Caused his man to go up north, the ball hit 'em again
Lame rap snitch n**** even told on the Mexican

[Hook 2X: Mr. Fantastik]
Rap snitches, telling all their business
Sit in the court and be their own star witness
Do you see the perpetrator? - Yeah, I'm right here
F*** around, get the whole label sent up for years

[Verse 2: MF DOOM]
True, there's rules to this shit, fools dare care
Everybody wanna rule the world with tears for fear
Yeah, yeah, tell 'em tell it on the mountain hill
Running up they mouth bill, everybody doubting still
Informer, keep it up and get tested
Pop through your bubble vest or double-breasted
He keep a lab down south in the little beast
So much heat you woulda thought it was the Middle East
A little grease always keeps the wheels a spinning
Like sitting on twenty threes to get the squealers grinning
Hitting on many trees, feel real linen
Spitting on enemies, get the steel for tin men
Where no brains but gum flap
He said his gun clap, then he fled after one slap (Pap!)
Son, shut your trap, save it for the b******
Mmm, delicious, rapp snitch knishes

[Outro: Mr. Fantastik & (MF DOOM)]
You know what I'm saying?
(It's terrible)
Crazy, man, I'm just analyzing this whole game
This is bugged out, man, n****s is snitching
Telling on they own self
(It's a horror, yo)
Fuck around and get everybody bagged, man
(Atrocities)
F*** around and get yo mama bagged, n****
You know your grandmama used to be bootlegging...
Fake hustling n****

Perpetrator? Yeah, I'm right here...

1.3 Cleaning Lyrics

One thing you might notice from the above example is that there are a number of songs on the album that have other artists featuring on. This is problematic as we want to analyse the lyrics of a single artist. One simple solution would be to simply discard any song that has another artist featuring on it. However this seems unnecessarily wasteful and for artists that typically have a lot of other people featuring on the album, it may prevent the artist from reaching the 20,000 word threshold.

Luckily, genius neatly formats featured verses in its lyrics in the following way; a verse by a featuring artist almost always has the name in square brackets like so [featuring artist]. This means, if we are expecting a featuring artist, we can write code to eliminate their verse from the song. The problem is this: the guest rapper may have one or more verses starting with [Artist Name] and ending with a double new line character. We want to return the lyrics text with these sections removed.

My function identifies each verse, determines whether the featuring artist's name is in the first section of each verse, and removes each verse identified as not the main artist's. This is by no means perfect. Although genius is relatively standardised, there will be exceptions, and bits of lyrical content not belonging to the main artist may sneak through. However, the simple rule in general gives relatively good results.

def remove_guest_verse(lyrics, name):
    lyrics += '\n\n'
    verses_listed = lyrics.split('\n\n')
    new_verses = []
    for verse in verses_listed:
        verse_start = verse[:len(name)+50]
        if (name in verse_start):
            pass
        else:
            new_verses.append(verse)
    lyrics = ''
    for verse in new_verses:
        lyrics += verse + '\n\n'
    return(lyrics)

song_doom_only = remove_guest_verse(doom_lyrics, 'Mr. Fantastik')
print(song_doom_only)
[Verse 2: MF DOOM]
True, there's rules to this shit, fools dare care
Everybody wanna rule the world with tears for fear
Yeah, yeah, tell 'em tell it on the mountain hill
Running up they mouth bill, everybody doubting still
Informer, keep it up and get tested
Pop through your bubble vest or double-breasted
He keep a lab down south in the little beast
So much heat you woulda thought it was the Middle East
A little grease always keeps the wheels a spinning
Like sitting on twenty threes to get the squealers grinning
Hitting on many trees, feel real linen
Spitting on enemies, get the steel for tin men
Where no brains but gum flap
He said his gun clap, then he fled after one slap (Pap!)
Son, shut your trap, save it for the bitches
Mmm, delicious, rapp snitch knishes

Perpetrator? Yeah, I'm right here...

The last two helper functions we will write before creating the main function to put this all together. The first is a cleaner function. It removes all lines starting with useless words like [chorus] and all repeated lines. Then, a small (perhaps unnecessary, but I like it) function to save our lyrics to a .txt file.

def clean(lyrics):
    line_list = lyrics.split('\n')
    line_list = [line + '\n' for line in line_list if (('[' not in line[:2]) and ('(' not in line[:2] ) ) and line_list.count(line)==1]
    return "".join(line_list)

cleaned = clean(song_doom_only)
print(cleaned)
True, there's rules to this shit, fools dare care
Everybody wanna rule the world with tears for fear
Yeah, yeah, tell 'em tell it on the mountain hill
Running up they mouth bill, everybody doubting still
Informer, keep it up and get tested
Pop through your bubble vest or double-breasted
He keep a lab down south in the little beast
So much heat you woulda thought it was the Middle East
A little grease always keeps the wheels a spinning
Like sitting on twenty threes to get the squealers grinning
Hitting on many trees, feel real linen
Spitting on enemies, get the steel for tin men
Where no brains but gum flap
He said his gun clap, then he fled after one slap (Pap!)
Son, shut your trap, save it for the bitches
Mmm, delicious, rapp snitch knishes
Perpetrator? Yeah, I'm right here...

If you're running this as a script on windows I recommend creating un-commenting the two top lines, creating a folder called text and using that structure to save all .txt files to that folder.

# import os
# path = os.path.dirname(os.path.realpath(__file__)).replace('\\', '/')+'/text/'

def lyrics_to_txt(lyrics, artist, path):
    thefile = open(path+artist+'.txt', 'w')
    thefile.write('%s' % lyrics)

The following code puts all previous functions together into one coherent subroutine.

def save_lyrics(artist):
    url = name_to_url(artist)
    album_urls =  get_albums(url)
    word_count = 0
    all_lyrics = ''

    if len(album_urls)>= 4:
        #for each album
        for album_url in album_urls:
            song_urls, song_titles = get_songs(album_url)

            #for each song
            for song_url, title in zip(song_urls, song_titles):
                feat=[]
                #find featuring artists
                if 'Ft' in title:
                    feat = title[title.index('(')+5:title.index(')')].replace(', ', ' & ').split(' & ')

                #get lyrics
                lyrics = get_lyrics(song_url)

                #remove guest verses and clean up
                for guest in feat:
                    lyrics = remove_guest_verse(lyrics, guest)
                lyrics = clean(lyrics)

                all_lyrics = all_lyrics + lyrics + '\n\n'

                #check the word count
                word_count += len(lyrics.replace('\n', ' ').split(' ')) 
                if word_count>= 20000:
                    print(artist + ' complete')

                    print(all_lyrics, artist)
                    return 

        print(artist + ' Couldn\'t get to 20,000 words')

    else:
        if len(album_urls)==0:
            print(artist + ' Problem loading page')
        else:
            print(artist + ' Not enough albums')

#save_lyrics('2 Chainz')

I've added in some debugging print lines so help identify the source of the problem if the download fails. This may be because the code failed to find the artist's page (for example, Tupac Shakur fails since the Genius page has the artist written as 2Pac). There are a number of artists that fail to meet the 20,000 word threshold. Sometimes, this is because there really aren't enough words in their entire discography. Other times it's because the six listed albums don't have enough words. In this case, the only way you can solve the problem is by manually adding more lyrics.

Hooray! We finally have a complete function for gathering lyrics. Part one complete. Now for the fun stuff: lyric analysis.