Day 10: Thursday, February 9, 2017 - Web Scraping 2¶
Understanding the web and the web inspector¶
Whoops, didn’t get to really talk about this much...
Pretend we had a big quiz in class today. It would look a lot like this:
Hello Data De-Serialization with JSON and CSV
Pretend we had a take-home midterm that required you to study all weekend for:
(once again, I don’t have every assignment prepared – so just do the hello-data warmup, and then the existing serialization exercises. I will tell you when the other exercises are ready)
Why web-scraping (and real-world data gathering) is hard¶
In reference to the Texas execution list homework: Web-scraping the Texas Executed Offenders List
There was one assignment that asked you to go through every webpage of last words to determine how many inmates said something religious-sounding before dying.
The main point of that exercise was to to give you practice in being able to write a program to handle a very boring, tedious, and miniscule part of that ambition: reading through the 500+ webpages and seeing if “God” existed in the text.
So, if you could do this, you’re in decent standing:
Because you gathered up all the things we need to parse/read in a single Python list. Now we just have to write functions that deal with one piece of that list at a time. And that’s the powerful automation that makes the difference between an impossible and an easy project.
So what were the hard parts?
Basically, whatever function/test you want to use to determined whether someone said something religious. First, you have to come up with a heuristic of what makes someone sound religiously faithful. Is it anyone who mentions “God”, or does it have to be a specific phrase? And how many words for God are there, or devoute phrases for that matter?
So that’s hard enough of a problem. Then you have the problem of the last words pages not being the literal last words of each inmate.
For example, some people tried to target the specific paragraph for last words:
def last_words_els(rawhtml): soup = BeautifulSoup(rawhtml, 'lxml') lastwords_text = el_soup.select('p').text return lastwords_text
Unfortunately, not all pages have the same structure. Some Last Statements are longer than 2 paragraphs. Others have fewer paragraphs.
from bs4 import BeautifulSoup from urllib.parse import urljoin import requests SRC_URL = 'http://wgetsnaps.github.io/tdcj-executed-offenders/death_row/dr_executed_offenders.html' def extract_last_words_url(inmate_tr): tds = inmate_tr.select('td') lastword_td = tds link = lastword_td.select('a') href = link.attrs['href'] return urljoin(SRC_URL, href) def has_religious_words(txtstr): RELIGIOUSWORDS = ['God', 'Lord', 'Savior', 'Soul', 'Allah', 'Prophet', 'Heaven', 'Hell'] for word in RELIGIOUSWORDS: if word in txtstr: return True else: pass # keep on going # if for loop ends, that means all # religiouswords were tested return False def fetch_inmate_rows(): html = requests.get(SRC_URL).text soup = BeautifulSoup(html, 'lxml') inmate_rows = soup.find_all('tr')[1:] return inmate_rows for row in fetch_inmate_rows(): lastwords_url = extract_last_words_url(row) cols = row.find_all('td') if 'no_last_statement' not in lastwords_url: # fetch page lastwordsresp = requests.get(lastwords_url) txt = lastwordsresp.text if has_religious_words(txt): print(cols.text, cols.text, 'is religious:', lastwords_url)