Command Line Earthquake Magnitude Frequency Count¶
An example real-world data wrangling problem that can be solved by chaining together Unix text-processing tools using the Unix pipe.
Also, there’s some practice with trying to do things from a remote computer, i.e. Stanford Farmshare.
Rubric¶
- Due date:
- 1:00 PM, Day 4: Thursday, January 19, 2017 - Theory of the Pipe
Points | Metric |
---|---|
1 | Having a correct subject headline |
1 | Your email contains an attachment named answer.sh |
1 | Running your answer.sh script produces the correct output |
Note
To do this assignment, and assuming you don’t want to go through the work of customizing your own personal laptop at the moment – you’ll need to use either the McClatchy Lab iMacs. Or, the Stanford Farmshare computers:
SSHing into Stanford Farmshare (Cardinal/Corn)
If you’re feeling like you don’t know where to start, that’s OK: you’re being thrown into a real-world data problem, with the hope that just doing the work will make the Unix/shell/command-line concepts clearer than just reading about them.
Here’s a short ad-hoc guide (including a section on how to create a “Bash script”):
The abridged guide to shell scripting and the command line
The rest of the instructions on how to do this problem is covered in a walkthrough at the end of this page. Read on!
Delivery format¶
Send an email to dun@stanford.edu with the subject:
compciv-2017::your_sunet_id::quake-mags
Where your_sunet_id
is your Stanford student ID, all-lowercase.
The body of the email should have a single line of text:
Hello Mutt!
And the email should contain an attachment named answers.zip
And you should use the program Mutt, which is described later in this section
The Work¶
The data and expected output¶
The USGS has a list of real-time data feeds:
http://earthquake.usgs.gov/earthquakes/feed/v1.0/csv.php
For this assignment, I want you to summarize the very big month-long earthquakes data file (8000+ quakes) into a group count of earthquakes by magnitude, truncated to the integer; i.e. an earthquake of M5.8 is in the “5” group.
Given this sample feed file, which you can download
– write a script of commands that downloads the data and processes it to produce output that looks like this:
mag,count
1,3863
2,924
3,160
4,408
5,116
6,7
7,2
This problem can be solved with these tools:
- ack
- sort
- uniq
- csvcut
- curl
Name the script answer.sh
and mail it to me as an attachment using mutt.
You should produce a script named answer.sh
which, when I run it on my own computer, does the downloading of data and the processing of it.
How to make a Bash/shell script¶
See further reading here: The abridged guide to shell scripting and the command line
But what is a “Bash/shell script”? It’s a sequence of commands that, instead of running one-by-one manually, you list in a text file (the “script”) so that you can execute the commands in a nice, automated batch.
If you named your shell script, myfunscript.sh
, here’s how you would run it:
$ bash myfunscript.sh
What does that do? It depends what is in myfunscript.sh
.
For example, if it contains these commands:
echo 'hello'
echo 'world'
Running the script produces this output:
hello
world
For this assignment, one of the requirements in the output is that the first line is, literally, this text:
mag,count
So, create a script file (again, just a plaintext file) and name it answer.sh
.
Then include this command as the first line:
echo 'mag,count'
The walkthrough at the end of this page describes the steps you’ll want to include...
Remote operation: sending email via mutt¶
Let’s pretend you finish answer.sh
and you’re ready to send it to me as an attachment.
This section contains the details of how to send an email from the command line, with an attachment. Yes, I know you know how to use an email client. Bear with me on learning how to do it from the command-line.
Use mutt to send the email¶
I want you to use the mutt
program, which is available on Stanford Farmshare (corn and cardinal machines). And, mutt
can be operated like a standard Unix tool.
This StackOverflow question has the gist of things: How do I send a file as an email attachment using Linux command line?
To send an email that says Hello
in the subject, Que?
in the body, to an email address (replace YOUR_ID@stanford.edu with your own Stanford email address for now):
$ echo 'Que?' | mutt -s 'Hello' YOUR_ID@stanford.edu
To send a file:
$ echo 'A file is attached to this email' | mutt -s 'Test 2' -a filename.etc -- YOUR_ID@stanford.edu
Note the double-hyphens which separate the name of the attached file from the recipient. (I dunno why mutt
uses that convention)
If you don’t have a file on Farmshare to send, let’s download one (the URL shown below goes to a copy of a kitten photo originally posted on Wikipedia):
$ curl http://i.imgur.com/8Jr7bLX.jpg > kitten.jpg
$ echo 'This email does not contain a dog' | mutt -s 'Do you like dogs?' -a kitten.jpg -- YOUR_ID@stanford.edu
Do NOT do this for this assignment, but if you wanted to send this same email to multiple people...
(the code snippet below uses the backslash convention to split up a long command into multiple lines)
$ echo 'This email does not contain a dog' \
| mutt -s 'Do you like dogs?' -a kitten.jpg \
-- person1@email.com person2@email.com
The answer¶
Note
Note
In the original description of the problem, the expected output excluded earthquakes of 0
magnitude; the example answers here don’t filter those out.
Many ways to go about this, but here’s what I have:
curl http://2017.compciv.org/_downloads/usgs_earthquake_feed_all_month.csv \
> quakes.csv
echo 'mag,count'
csvcut -c 5 quakes.csv \
| ack -o '\d+\.' \
| ack -o '\d+' \
| sort | uniq -c \
| ack '(\d+)\s+(\d+)' --output '$2,$1'
There are ways to be slicker about it and do it in fewer steps. The following example skips the saving of the data to an intermediary quakes.csv
and just feeds the output of curl
right into csvcut
. I also use a lookahead in the regex just to be fancy:
echo 'mag,count'
curl -s http://2017.compciv.org/_downloads/usgs_earthquake_feed_all_month.csv \
| csvcut -c 5 \
| ack -o '\d+(?=\.)' \
| sort | uniq -c \
| ack '(\d+)\s+(\d+)' --output '$2,$1'
Sample problem: doing a time-series of Trump/Clinton tweets¶
Let’s do a problem that is nearly the same thing, just different dataset: Let’s do a count of Trump and/or Clinton tweets by day, month, and hour.
The data¶
Note
Downloads
The data layout has 4 columns. The first 3 are the metadata of each tweet:
ID Posted at Screen name 815449933453127681 2017-01-01 06:49:49 +0000 realDonaldTrump 815449868739211265 2017-01-01 06:49:33 +0000 realDonaldTrump 815433444591304704 2017-01-01 05:44:17 +0000 realDonaldTrump 815433217595547648 2017-01-01 05:43:23 +0000 realDonaldTrump 815432169464197121 2017-01-01 05:39:13 +0000 realDonaldTrump
The fourth column, Text
, is the actual text of the tweet (which can include emoji):
Text RT @IvankaTrump: 2016 has been one of the most eventful and exciting years of my life. I wish you peace, joy, love and laughter. Hap… https://t.co/A1I3tvTySZ RT @DonaldJTrumpJr: Happy new year everyone. #newyear #family #vacation #familytime https://t.co/u9fJIKNoZq RT @EricTrump: 2016 was such an incredible year for our entire family! My beautiful wife @LaraLeaTrump made it even better! 🇺🇸🇺🇸 https://t.co/M0SuRGn0il RT @Reince: Happy New Year + God’s blessings to you all. Looking forward to incredible things in 2017! @realDonaldTrump will Make America Great Again! RT @DanScavino: On behalf of our next #POTUS & @TeamTrump-
#HappyNewYear AMERICA🇺🇸
Note: To do a time-series analysis, we don’t really care about the Text
column.
Walkthrough¶
Note: these steps should be done from either the McClatchy iMacs. Or from Stanford Farmshare.
Step 1. Download the file¶
Let’s just do Trump for now. The direct URL to download his Trump data is:
https://2017.compciv.org/_downloads/realdonaldtrump-tweets.csv
You should already know how to download that data with the browser. Here’s how to do it via the command-line:
$ curl https://2017.compciv.org/_downloads/realdonaldtrump-tweets.csv
Step 2. Actually save the data to a local file¶
That should have dumped a bunch of text onto your screen, i.e. “standard out”. We don’t want that. We would rather save the contents of that URL into a local text file.
Let’s call it trumptweets.csv
$ curl https://2017.compciv.org/_downloads/realdonaldtrump-tweets.csv > trumptweets.csv
What’s in that file? If you downloaded it to your own computer, try opening it in Excel.
Or, if you want to print the first 10 lines to standard output (i.e. your screen), try this:
$ head -n 10 trumptweets.csv
Step 3. Use csvcut to filter the data file by a single column¶
I apologize for not having a thorough writeup on the csvcut
tool, though it is pretty easy to figure out (which we will do in class). But here’s a nice example from dataquest.io.
Here are the 2 options we want to try:
$ csvcut -n trumptweets.csv
And then:
$ csvcut -c 'Posted at' trumptweets.csv
Step 4. Filter for patterns¶
Now, use ack
to apply regex filtering.
Guide: ack - better than grep
Think of the pattern needed to match the “year-month” of the timestamp. Or, just the hour.
Step 5. Do a group count¶
Guide: sort - sort lines in a stream
How does this apply to the homework. Here’s what the shell script for counting Trump tweets per day might look like:
echo 'day,tweet_count'
curl https://2017.compciv.org/_downloads/realdonaldtrump-tweets.csv > trumptweets.csv
csvcut -c 'Posted at' trumptweets.csv \
| ack -o '\d{4}-\d{2}-\d{2}' \
| sort \
| uniq -c \
| sort -rn