Day 10: Thursday, February 9, 2017 - Web Scraping 2

Understanding the web and the web inspector

Whoops, didn’t get to really talk about this much...


Pretend we had a big quiz in class today. It would look a lot like this:

Hello Data De-Serialization with JSON and CSV

Pretend we had a take-home midterm that required you to study all weekend for:

Solid Serialization Skills

(once again, I don’t have every assignment prepared – so just do the hello-data warmup, and then the existing serialization exercises. I will tell you when the other exercises are ready)

Why web-scraping (and real-world data gathering) is hard

In reference to the Texas execution list homework: Web-scraping the Texas Executed Offenders List

There was one assignment that asked you to go through every webpage of last words to determine how many inmates said something religious-sounding before dying.

The main point of that exercise was to to give you practice in being able to write a program to handle a very boring, tedious, and miniscule part of that ambition: reading through the 500+ webpages and seeing if “God” existed in the text.

So, if you could do this, you’re in decent standing:

Because you gathered up all the things we need to parse/read in a single Python list. Now we just have to write functions that deal with one piece of that list at a time. And that’s the powerful automation that makes the difference between an impossible and an easy project.

So what were the hard parts?

Basically, whatever function/test you want to use to determined whether someone said something religious. First, you have to come up with a heuristic of what makes someone sound religiously faithful. Is it anyone who mentions “God”, or does it have to be a specific phrase? And how many words for God are there, or devoute phrases for that matter?

So that’s hard enough of a problem. Then you have the problem of the last words pages not being the literal last words of each inmate.

For example, some people tried to target the specific paragraph for last words:

def last_words_els(rawhtml):
    soup = BeautifulSoup(rawhtml, 'lxml')
    lastwords_text ='p')[6].text
    return lastwords_text

Unfortunately, not all pages have the same structure. Some Last Statements are longer than 2 paragraphs. Others have fewer paragraphs.

from bs4 import BeautifulSoup
from urllib.parse import urljoin
import requests

SRC_URL = ''

def extract_last_words_url(inmate_tr):
    tds ='td')
    lastword_td = tds[2]
    link ='a')[0]
    href = link.attrs['href']

    return urljoin(SRC_URL, href)

def has_religious_words(txtstr):
    RELIGIOUSWORDS = ['God', 'Lord', 'Savior', 'Soul', 'Allah', 'Prophet', 'Heaven', 'Hell']
    for word in RELIGIOUSWORDS:
        if word in txtstr:
            return True
            pass # keep on going
    # if for loop ends, that means all
    # religiouswords were tested
    return False

def fetch_inmate_rows():
    html = requests.get(SRC_URL).text
    soup = BeautifulSoup(html, 'lxml')
    inmate_rows = soup.find_all('tr')[1:]
    return inmate_rows

for row in fetch_inmate_rows():
    lastwords_url = extract_last_words_url(row)
    cols = row.find_all('td')
    if 'no_last_statement' not in lastwords_url:
        # fetch page
        lastwordsresp = requests.get(lastwords_url)
        txt = lastwordsresp.text
        if has_religious_words(txt):
            print(cols[4].text, cols[3].text, 'is religious:', lastwords_url)