import requests
from bs4 import BeautifulSoup
28 Web Scraping — Part 2
In this lesson, we’re going to learn more about scraping data with the Python libraries requests and BeautifulSoup. We’re also going to introduce regular expressions, which will help us extract and clean data in a more fine-grained way.
We will cover how to:
- Programmatically access the text of a web page
- Extract certain HTML elements
- Extract and clean data with regular expressions
Our goal for this lesson is to build a web scraping function called get_all_songs_from_album
which will get all the song titles from any album on Genius.com.
Import Requests and BeautifulSoup
We’re again going to use the requests
library and the BeautifulSoup
library to scrape our list of album songs.
The first album that we’re going to scrape is Missy Elliott’s “Under Construction” (2002), which debuted at No. 3 on The Billboard Top 200 charts.
from IPython.display import IFrame
"https://www.youtube.com/embed/cjIvu7e6Wq8?start=31", width='500', height='400') IFrame(
Get HTML Data and Extract Text
= requests.get("https://genius.com/albums/Missy-elliott/Under-construction")
response = response.text html_string
Transform into BeautifulSoup Document
= BeautifulSoup(html_string, "html.parser") document
Your Turn!
We want to extract just the song titles from Missy Elliott’s album “Under Construction.” Turn on your web browser’s “Inspect” function and find the HTML tag associated with each song title.
https://genius.com/albums/Missy-elliott/Under-construction
Select “Click to show” to see the answer.
= document.find_all("h3")
song_title_tags song_title_tags
[<h3 class="chart_row-content-title">
Intro/Go To The Floor
<span class="chart_row-content-title-subtitle">Lyrics</span>
</h3>,
<h3 class="chart_row-content-title">
Bring the Pain (Ft. Method Man)
<span class="chart_row-content-title-subtitle">Lyrics</span>
</h3>,
<h3 class="chart_row-content-title">
Gossip Folks (Ft. Ludacris)
<span class="chart_row-content-title-subtitle">Lyrics</span>
</h3>,
<h3 class="chart_row-content-title">
Work It
<span class="chart_row-content-title-subtitle">Lyrics</span>
</h3>,
<h3 class="chart_row-content-title">
Back in the Day (Ft. JAY-Z)
<span class="chart_row-content-title-subtitle">Lyrics</span>
</h3>,
<h3 class="chart_row-content-title">
Funky Fresh Dressed (Ft. Ms. Jade)
<span class="chart_row-content-title-subtitle">Lyrics</span>
</h3>,
<h3 class="chart_row-content-title">
Pussycat
<span class="chart_row-content-title-subtitle">Lyrics</span>
</h3>,
<h3 class="chart_row-content-title">
Nothing Out There for Me (Ft. Beyoncé)
<span class="chart_row-content-title-subtitle">Lyrics</span>
</h3>,
<h3 class="chart_row-content-title">
Slide
<span class="chart_row-content-title-subtitle">Lyrics</span>
</h3>,
<h3 class="chart_row-content-title">
Play That Beat
<span class="chart_row-content-title-subtitle">Lyrics</span>
</h3>,
<h3 class="chart_row-content-title">
Ain't That Funny
<span class="chart_row-content-title-subtitle">Lyrics</span>
</h3>,
<h3 class="chart_row-content-title">
Hot
<span class="chart_row-content-title-subtitle">Lyrics</span>
</h3>,
<h3 class="chart_row-content-title">
Can You Hear Me (Ft. TLC)
<span class="chart_row-content-title-subtitle">Lyrics</span>
</h3>,
<h3 class="chart_row-content-title">
Work It (Remix) (Ft. 50 Cent)
<span class="chart_row-content-title-subtitle">Lyrics</span>
</h3>,
<h3 class="annotation_label">Does this album have any certifications?</h3>,
<h3 class="annotation_label">How did this album chart?</h3>,
<h3>ICONOLOGY</h3>,
<h3>Misdemeanor</h3>]
Now write a for
loop that extracts the text from each song title tag and .appends()
it to a list called song_titles
:
= []
song_titles for song in song_title_tags:
= song.text
song_title_missy song_titles.append(song_title_missy)
song_titles
['\n Intro/Go To The Floor\n Lyrics\n',
'\n Bring the Pain (Ft.\xa0Method\xa0Man)\n Lyrics\n',
'\n Gossip Folks (Ft.\xa0Ludacris)\n Lyrics\n',
'\n Work It\n Lyrics\n',
'\n Back in the Day (Ft.\xa0JAY-Z)\n Lyrics\n',
'\n Funky Fresh Dressed (Ft.\xa0Ms.\xa0Jade)\n Lyrics\n',
'\n Pussycat\n Lyrics\n',
'\n Nothing Out There for Me (Ft.\xa0Beyoncé)\n Lyrics\n',
'\n Slide\n Lyrics\n',
'\n Play That Beat\n Lyrics\n',
"\n Ain't That Funny\n Lyrics\n",
'\n Hot\n Lyrics\n',
'\n Can You Hear Me (Ft.\xa0TLC)\n Lyrics\n',
'\n Work It (Remix) (Ft.\xa050\xa0Cent)\n Lyrics\n',
'Does this album have any certifications?',
'How did this album chart?',
'ICONOLOGY',
'Misdemeanor']
Now transform that same for
loop into a list comprehension.
= [song.text for song in song_title_tags] missy_song_titles
missy_song_titles
['\n Intro/Go To The Floor\n Lyrics\n',
'\n Bring the Pain (Ft.\xa0Method\xa0Man)\n Lyrics\n',
'\n Gossip Folks (Ft.\xa0Ludacris)\n Lyrics\n',
'\n Work It\n Lyrics\n',
'\n Back in the Day (Ft.\xa0JAY-Z)\n Lyrics\n',
'\n Funky Fresh Dressed (Ft.\xa0Ms.\xa0Jade)\n Lyrics\n',
'\n Pussycat\n Lyrics\n',
'\n Nothing Out There for Me (Ft.\xa0Beyoncé)\n Lyrics\n',
'\n Slide\n Lyrics\n',
'\n Play That Beat\n Lyrics\n',
"\n Ain't That Funny\n Lyrics\n",
'\n Hot\n Lyrics\n',
'\n Can You Hear Me (Ft.\xa0TLC)\n Lyrics\n',
'\n Work It (Remix) (Ft.\xa050\xa0Cent)\n Lyrics\n',
'Does this album have any certifications?',
'How did this album chart?',
'ICONOLOGY',
'Misdemeanor']
Are there things in your list that aren’t song titles? If so, use .find_all()
with more specific HTML attributes attrs={}
= document.find_all("h3", attrs={"class": "chart_row-content-title"}) song_title_tags
missy_song_titles
['\n Intro/Go To The Floor\n Lyrics\n',
'\n Bring the Pain (Ft.\xa0Method\xa0Man)\n Lyrics\n',
'\n Gossip Folks (Ft.\xa0Ludacris)\n Lyrics\n',
'\n Work It\n Lyrics\n',
'\n Back in the Day (Ft.\xa0JAY-Z)\n Lyrics\n',
'\n Funky Fresh Dressed (Ft.\xa0Ms.\xa0Jade)\n Lyrics\n',
'\n Pussycat\n Lyrics\n',
'\n Nothing Out There for Me (Ft.\xa0Beyoncé)\n Lyrics\n',
'\n Slide\n Lyrics\n',
'\n Play That Beat\n Lyrics\n',
"\n Ain't That Funny\n Lyrics\n",
'\n Hot\n Lyrics\n',
'\n Can You Hear Me (Ft.\xa0TLC)\n Lyrics\n',
'\n Work It (Remix) (Ft.\xa050\xa0Cent)\n Lyrics\n',
'\n Pepsi Super Bowl XLIX Halftime Show by\xa0NFL (Ft.\xa0Katy\xa0Perry, Lenny\xa0Kravitz & Missy\xa0Elliott)\n Lyrics\n']
28.1 Regular Expressions
Great! Now we have have our list of song titles from Missy Elliot’s album “Under Construction.” But if you notice, these song titles are pretty messy, and we need to clean them up.
To do so, we’re going to use built-in string methods and a Python library called re
, short for regular expressions.
Regular expressions are basically like a very sophisticated find-and-replace. Regular expression are not exclusive to Python and are used in many programming languages as well as in search engines, text editors, and word processors.
28.1.1 Import Regular Expressions Library
import re
To practice with regular expressions, we’re going to use a sample messy song title from our messy song titles list.
= "\n Back in the Day (Ft.\xa0JAY-Z)\n Lyrics\n" sample_song
Replace with Python String Method
Remember the string method .replace()
? With this built-in string method, we can easily get rid of the new line characters \n
or the word “Lyrics” from our sample_song
, which is very useful.
"\n", "") sample_song.replace(
' Back in the Day (Ft.\xa0JAY-Z) Lyrics'
"Lyrics", "") sample_song.replace(
'\n Back in the Day (Ft.\xa0JAY-Z)\n \n'
28.1.2 Replace with Regex
However, with regular expressions, we can replace strings with even more power and flexibility.
To replace a string with regular expressions, we use re.sub(old_pattern, new_pattern, text_that_contains_pattern)
. We can do exactly the same thing that we did with the built-in string method .replace()
.
sample_song
'\n Back in the Day (Ft.\xa0JAY-Z)\n Lyrics\n'
"\n", "", sample_song) re.sub(
' Back in the Day (Ft.\xa0JAY-Z) Lyrics'
28.2 Special RegEx Characters
But regular expressions have certain characters with special pattern-matching powers, which is what allows us to do more cleaning, manipulating, and searching than with basic string methods. Below are some of the special regular expression characters.
Regular Expression Pattern | Matches |
---|---|
. |
any character |
\w |
word |
\W |
NOT word |
\d |
digit |
\D |
NOT digit |
\s |
whitespace |
\S |
NOT whitespace |
[abc] |
Any of abc |
[^abc] |
Not any of abc |
(abc) |
Specific capture of “abc” |
+ |
1 or more instances |
* |
0 or more instances |
? |
0 or 1 instance |
You can explore and experiment with regular expression characters and combinations at Regexr.com.
We can replace anything that is not a word \W
with ” “:
"\W", " ", sample_song) re.sub(
' Back in the Day Ft JAY Z Lyrics '
Replace anything that is a word \w
with ” “:
"\w", " ", sample_song) re.sub(
'\n ( .\xa0 - )\n \n'
The character +
means “match one or more instance” of the pattern, which allows us to remove multiple not word patterns in a row.
"\W+", " ", sample_song) re.sub(
' Back in the Day Ft JAY Z Lyrics '
28.3 Compile Pattern
An efficient way to build and save a regular expression pattern is with re.compile()
= re.compile("\W+")
not_word_pattern " ", sample_song) re.sub(not_word_pattern,
' Back in the Day Ft JAY Z Lyrics '
28.4 Search for Pattern
In addition to replacing text, we can also find and return text. With re.search()
, we can find and return any particular pattern. The re.search()
function returns something called a “match object,” which we can access with .group()
.
For example, searching with the pattern \w+
will return the very first word in sample_song
:
= re.compile("\w+")
word_pattern word_pattern.search(sample_song)
<re.Match object; span=(15, 19), match='Back'>
0) word_pattern.search(sample_song).group(
'Back'
28.5 Find All Instances of Pattern
The function re.findall()
will return a list of every instance of a particular pattern.
= re.compile("\w+")
word_pattern word_pattern.findall(sample_song)
['Back', 'in', 'the', 'Day', 'Ft', 'JAY', 'Z', 'Lyrics']
When you combine special regular expression characters, you can make your pattern matching very specific and very powerful. If we had a document that contains a bunch of email addresses, we could use the pattern [\w.]+@[\w.]+
to find and extract the words that appear on other side of the @
character, aka find and extract all the email addresses.
= "The important email addresses are important@cool.com, signficant@sweet.org"
text_with_emails = re.findall('[\w.]+@[\w.]+', text_with_emails) extracted_emails
extracted_emails
['important@cool.com', 'signficant@sweet.org']
28.6 Match Before a Certain String
For our song titles, we might want to extract everything that comes before “(Ft.)” because we don’t care as much about the featured artists, and because the featured artists makes the song titles really long. To match everything that comes before a certain string, we can use the pattern .*(?=desired_pattern)
which matches 0 or more *
of any character .
that comes before (?=)
the string “desired_pattern.”
= re.compile(".*(?=Ft)")
before_ft_pattern 0) before_ft_pattern.search(sample_song).group(
' Back in the Day ('
28.7 Backslash Escape Characters
Nice! We got everything before the featured artist. Well, almost. We still have a weird, lingering open parentheses. That’s because we were matching “Ft” not “(Ft”. Let’s match everything before “(Ft” instead.
To do so, we’re going to have to make a slight adjustment. Remember that parentheses ()
are special regular expression characters. To make clear that we mean a literal parentheses and not a special regular expression character, we have to use an escape backslash \
before the character.
= re.compile(".*(?=\(Ft)")
before_ft_pattern = before_ft_pattern.search(sample_song).group(0)
clean_sample_song_title clean_sample_song_title
' Back in the Day '
28.8 Strip Leading and Trailing Whitespace
The last thing we’ll do to clean up our song title is to use the built-in string method .strip()
which strips leading and trailing whitespace.
clean_sample_song_title.strip()
'Back in the Day'
28.9 Build Functions and Put It All Together
Let’s put all this cleaning together in a function called clean_up
. It will match and strip everything before “(Ft.)” if the song title contains a featured artist, and it will remove the word “Lyrics” and strip whitespace if the song title does not contain “(Ft.)”.
def clean_up(song_title):
if "Ft" in song_title:
= re.compile(".*(?=\(Ft)")
before_ft_pattern = before_ft_pattern.search(song_title).group(0)
song_title_before_ft = song_title_before_ft.strip()
clean_song_title
else:
= song_title.replace("Lyrics", "")
song_title_no_lyrics = song_title_no_lyrics.strip()
clean_song_title
return clean_song_title
for song in missy_song_titles] [clean_up(song)
['Intro/Go To The Floor',
'Bring the Pain',
'Gossip Folks',
'Work It',
'Back in the Day',
'Funky Fresh Dressed',
'Pussycat',
'Nothing Out There for Me',
'Slide',
'Play That Beat',
"Ain't That Funny",
'Hot',
'Can You Hear Me',
'Work It (Remix)',
'Pepsi Super Bowl XLIX Halftime Show by\xa0NFL']
We were able to extract the song titles for Missy Elliott’s album “Under Construction.” Success! But now we want to make a function that can do the same thing for any artist and album title.
Take a look at Beyonce’s album “Lemonade” on Genius.com and see how the web page compares to Missy Elliott’s “Under Construction.” They look extremely similar, right? Because all Genius album pages are identical, we can use the same code that we did for Missy Elliott and just substitute in different artist and album names with an f-string for the Genius URL:
f"https://genius.com/albums/{artist}/{album_name}"
def get_all_songs_from_album(artist, album_name):
= artist.replace(" ", "-")
artist = album_name.replace(" ", "-")
album_name
= requests.get(f"https://genius.com/albums/{artist}/{album_name}")
response = response.text
html_string = BeautifulSoup(html_string, "html.parser")
document = document.find_all("h3", attrs={"class": "chart_row-content-title"})
song_title_tags = [song_title.text for song_title in song_title_tags]
song_titles
= []
clean_songs for song_title in song_titles:
= clean_up(song_title)
clean_song
clean_songs.append(clean_song)
return clean_songs
'Beyonce', 'Lemonade') get_all_songs_from_album(
['Pray You Catch Me',
'Hold Up',
"Don't Hurt Yourself",
'Sorry',
'6 Inch',
'Daddy Lessons',
'Love Drought',
'Sandcastles',
'Forward',
'Freedom',
'All Night',
'Formation',
'Sorry (Original Demo)',
'Lemonade Film (Script)']
'Taylor Swift', 'Red') get_all_songs_from_album(
['State of Grace',
'Red',
'Treacherous',
'I Knew You Were Trouble',
'All Too Well',
'22',
'I Almost Do',
'We Are Never Ever Getting Back Together',
'Stay Stay Stay',
'The Last Time',
'Holy Ground',
'Sad Beautiful Tragic',
'The Lucky One',
'Everything Has Changed',
'Starlight',
'Begin Again',
'I Knew You Were Trouble (Intro)']
'Mitski', 'Be The Cowboy') get_all_songs_from_album(
['Geyser',
"Why Didn't You Stop Me?",
'Old Friend',
'A Pearl',
'Lonesome Love',
'Remember My Name',
'Me and My Husband',
'Come Into the Water',
'Nobody',
'Pink in the Night',
'A Horse Named Cold Air',
'Washing Machine Heart',
'Blue Light',
'Two Slow Dancers']
28.10 Your Turn!
'#Your Choice of Artist', '#Your Choice of Album') get_all_songs_from_album(
If there is anything wrong, please open an issue on GitHub or email f.pianzola@rug.nl