Day 17: Tuesday, March 7, 2017 Bot Talk¶
Totally endorse this kind of programming/bot:
Take a look at ProPublica’s ElectionBot to see how their “bot” brings in different datasets to tell a bunch of stories:
First Bot Due¶
Let’s aim to have a bot done by next Thursday. If you have many ideas, pick your most “fun” one, or at least the one that doesn’t require a ton of research.
Prepare the data¶
For this Thursday, we’ll work in class on doing data crunching. I’ll have a cheatsheet reviewing the ways we’ve worked with data, plus a few other methods. But basically, I want everyone to feel as familiar with their data as possible, ideally, being able to write the code that wrangles the data for the insight you need for the final output.
So these 2 things should be your priority:
- If it is a dataset, being able to download it, and ideally, write the script to do the same.
- If it is an API, trying to authenticate with the API and successfully logging on.
This won’t be a graded assignment, but before next class, please send me an email with the following information regarding the data that you plan to use.
For datasets, i.e. files that you download in bulk, such as the [Earthquake Archive](https://earthquake.usgs.gov/earthquakes/search/) or [Washington Post’s Police Shootings](https://github.com/washingtonpost/data-police-shootings)
Name of the dataset
URL for the landing page of the dataset
URL (if applicable) for the direct link of the dataset
Attach the script you’ve used to download the data OR tell me what issues you’ve run into.
Unless you’ve run into problems, attach another script you’ve used to calculate all of the applicable metrics:
- How many records total
- Number of columns/attributes per record/row
- The earliest record (if the records have date/times)
- The latest record
- A group count by a column
If the data source is an API, tell me:
Name of API/service
Landing page for the API
Endpoints that you’re using, i.e. the NYT API has an Archive and an Article Search endpoint. Make sure you know the difference.
Let me know if you’ve been able to authenticate
For each endpoint you plan to use, get a copy of the sample data and attach it to the email. For example, if you were going to fetch a user’s timeline, you would provide this endpoint URL:
And you would copy the example JSON they have and save it as a file and send it as an attachment.
This sounds like a lot of work but I just want to make sure you know the details behind what you’re trying to do. If you’re trying to do something with Twitter, then you need to know that there is more than one way to fetch data from Twitter.
If you’re going to do something with NYPD stop-and-frisk, then it probably helps to know the files come as ZIP files, so you (with my help if needed) can unzip the data.
Sign up for Github, make a repo¶
If you haven’t used Github before and you don’t know where your
~/.ssh directory is, please download the Github Desktop app.
See if you can go through the Hello World guide and make a repo named
Email me at email@example.com with a link to your repo.
OpenDataNetwork is the best search portal for U.S. city data: https://www.opendatanetwork.com
SF OpenData has datasets close to home: https://data.sfgov.org/browse?limitTo=datasets
As does Menlo Park, though not as much: https://data.menlopark.org/browse
You might find something interesting at https://www.reddit.com/r/datasets/
The BuzzFeedNews team has open-sourced the code behind their investigations and their data: https://github.com/BuzzFeedNews/everything
FiveThirtyEight has a massive Github data repo; you might find it helpful to read the stories that accompany the data: https://github.com/fivethirtyeight/data
Immediately interesting datasets¶
If you’re relatively new to data-crunching, I do not recommend picking a dataset that seems important but that you have no domain expertise nor any first-hand knowledge of what the data represents. Do building permits or lobbying disclosures affect you personally? Not if you’re not a developer or politician, so maybe that’s not the right dataset for this relatively quick project.
But things like earthquakes, getting punched in the face, seeing a mouse in your food – those are all things you might have experienced. A lot easier to deal with the data when you know how it connects to the real world.
Here’s a quick list:
- SF restaurant scores
- Yahoo Finance (play around with the URLs)
- SFPD Crime Incidents
- Worldwide Starbucks locations
- California public salaries
- SF 311 data
- Real Estate prices from Zillow
- Washington Post Police Shootings
- SF Eviction Notices
- Death Penalty Execution Database
- Popular Baby Names
- congress-legislators Useful data for each Congressmember, including social media and contact info.
These datasets are definitely important. But might be too much to grok in a single week. Might want to save it for the next bot (or some other class):
- Federal Campaign Finance http://www.fec.gov/finance/disclosure/ftp_download.shtml
- Tons of Congress bill and action data: https://github.com/unitedstates/congress/wiki
- Census https://www.census.gov/developers/ - (Use https://censusreporter.org/ to explore what the Census contains)
- [IRS Taxes by Zip Code](https://www.irs.gov/uac/SOI-Tax-Stats-Individual-Income-Tax-Statistics-ZIP-Code-Data-(SOI))
- Fatality Analysis Reporting System (FARS) https://www.nhtsa.gov/research-data/fatality-analysis-reporting-system-fars
- FDA Pillbox data https://pillbox.nlm.nih.gov/developer.html
- FDA Adverse Event Reporting System https://open.fda.gov/data/faers/
- OpenPayments (to Doctors) https://openpaymentsdata.cms.gov/search
- OSHA enforcement data: http://ogesdw.dol.gov/views/data_summary.php
- Ohio voter registration https://www6.sos.state.oh.us/ords/f?p=111:1
Need an idea? Feel free to use these really cool ideas from me. Or adapt them to your own liking:
- Name popularity: using the Social Security Administration baby name data, and given the year of a user’s birth and the user’s name, present data about how popular their name is.
- Which baby name: user specifies gender, “edginess”, which letters to have, return most relevant name.
- Stanford free food, using at least 2 different department calendars. User input is a date, bot returns food for that week.
- The next Texas/Florida inmate to die (see the Marshall Project)
- Some kind of Starbucks index, e.g. number of Starbucks in a 10 mile radius from a location, and how good the schools are.
- For a given location in SF, the number of reported violent crimes in the past 2 weeks.
- Given a Congressmember’s twitter handle, a histogram analysis of their tweeting activity.
- Using the NYT article search API, return some kind of summation of today’s news (number of articles, number of Trump articles, words used, gender ratio of authors)
Of course ask for my help regarding how to authenticate. But if you’re still a little shaky with data, might want to skip the big data players for now:
- Twitter https://dev.twitter.com/rest/public
- NYTImes https://developer.nytimes.com/
- Instagram https://www.instagram.com/developer/
- Reddit https://www.reddit.com/dev/api/
- ProPublica’s Congress API: https://propublica.github.io/congress-api-docs/
- Google Static Maps: https://developers.google.com/maps/documentation/static-maps/intro
- Google Street View https://developers.google.com/maps/documentation/streetview/