Command Line Earthquake Magnitude Frequency Count

An example real-world data wrangling problem that can be solved by chaining together Unix text-processing tools using the Unix pipe.

Also, there’s some practice with trying to do things from a remote computer, i.e. Stanford Farmshare.


Due date:
1:00 PM, Day 4: Thursday, January 19, 2017 - Theory of the Pipe
Points Metric
1 Having a correct subject headline
1 Your email contains an attachment named
1 Running your script produces the correct output


To do this assignment, and assuming you don’t want to go through the work of customizing your own personal laptop at the moment – you’ll need to use either the McClatchy Lab iMacs. Or, the Stanford Farmshare computers:

SSHing into Stanford Farmshare (Cardinal/Corn)

If you’re feeling like you don’t know where to start, that’s OK: you’re being thrown into a real-world data problem, with the hope that just doing the work will make the Unix/shell/command-line concepts clearer than just reading about them.

Here’s a short ad-hoc guide (including a section on how to create a “Bash script”):

The abridged guide to shell scripting and the command line

The rest of the instructions on how to do this problem is covered in a walkthrough at the end of this page. Read on!

Delivery format

Send an email to with the subject:


Where your_sunet_id is your Stanford student ID, all-lowercase.

The body of the email should have a single line of text:

Hello Mutt!

And the email should contain an attachment named

And you should use the program Mutt, which is described later in this section

The Work

The data and expected output

The USGS has a list of real-time data feeds:

For this assignment, I want you to summarize the very big month-long earthquakes data file (8000+ quakes) into a group count of earthquakes by magnitude, truncated to the integer; i.e. an earthquake of M5.8 is in the “5” group.

Given this sample feed file, which you can download – write a script of commands that downloads the data and processes it to produce output that looks like this:


This problem can be solved with these tools:

  • ack
  • sort
  • uniq
  • csvcut
  • curl

Name the script and mail it to me as an attachment using mutt.

You should produce a script named which, when I run it on my own computer, does the downloading of data and the processing of it.

How to make a Bash/shell script

See further reading here: The abridged guide to shell scripting and the command line

But what is a “Bash/shell script”? It’s a sequence of commands that, instead of running one-by-one manually, you list in a text file (the “script”) so that you can execute the commands in a nice, automated batch.

If you named your shell script,, here’s how you would run it:

$ bash

What does that do? It depends what is in

For example, if it contains these commands:

echo 'hello'
echo 'world'

Running the script produces this output:


For this assignment, one of the requirements in the output is that the first line is, literally, this text:


So, create a script file (again, just a plaintext file) and name it

Then include this command as the first line:

echo 'mag,count'

The walkthrough at the end of this page describes the steps you’ll want to include...

Remote operation: sending email via mutt

Let’s pretend you finish and you’re ready to send it to me as an attachment.

This section contains the details of how to send an email from the command line, with an attachment. Yes, I know you know how to use an email client. Bear with me on learning how to do it from the command-line.

Use mutt to send the email

I want you to use the mutt program, which is available on Stanford Farmshare (corn and cardinal machines). And, mutt can be operated like a standard Unix tool.

This StackOverflow question has the gist of things: How do I send a file as an email attachment using Linux command line?

To send an email that says Hello in the subject, Que? in the body, to an email address (replace with your own Stanford email address for now):

$ echo 'Que?' | mutt -s 'Hello'

To send a file:

$ echo 'A file is attached to this email' |  mutt -s 'Test 2' -a filename.etc --

Note the double-hyphens which separate the name of the attached file from the recipient. (I dunno why mutt uses that convention)

If you don’t have a file on Farmshare to send, let’s download one (the URL shown below goes to a copy of a kitten photo originally posted on Wikipedia):

$ curl > kitten.jpg

$ echo 'This email does not contain a dog' | mutt -s 'Do you like dogs?' -a kitten.jpg --

Do NOT do this for this assignment, but if you wanted to send this same email to multiple people...

(the code snippet below uses the backslash convention to split up a long command into multiple lines)

$ echo 'This email does not contain a dog' \
    | mutt -s 'Do you like dogs?' -a kitten.jpg \

The answer



In the original description of the problem, the expected output excluded earthquakes of 0 magnitude; the example answers here don’t filter those out.

Many ways to go about this, but here’s what I have:

curl \
    > quakes.csv

echo 'mag,count'
csvcut -c 5 quakes.csv \
    | ack -o '\d+\.' \
    | ack -o '\d+'    \
    | sort | uniq -c \
    | ack '(\d+)\s+(\d+)' --output '$2,$1'

There are ways to be slicker about it and do it in fewer steps. The following example skips the saving of the data to an intermediary quakes.csv and just feeds the output of curl right into csvcut. I also use a lookahead in the regex just to be fancy:

echo 'mag,count'

curl -s \
    | csvcut -c 5 \
    | ack -o '\d+(?=\.)' \
    | sort | uniq -c \
    | ack '(\d+)\s+(\d+)' --output '$2,$1'

Sample problem: doing a time-series of Trump/Clinton tweets

Let’s do a problem that is nearly the same thing, just different dataset: Let’s do a count of Trump and/or Clinton tweets by day, month, and hour.

The data

The data layout has 4 columns. The first 3 are the metadata of each tweet:

ID Posted at Screen name
815449933453127681 2017-01-01 06:49:49 +0000 realDonaldTrump
815449868739211265 2017-01-01 06:49:33 +0000 realDonaldTrump
815433444591304704 2017-01-01 05:44:17 +0000 realDonaldTrump
815433217595547648 2017-01-01 05:43:23 +0000 realDonaldTrump
815432169464197121 2017-01-01 05:39:13 +0000 realDonaldTrump

The fourth column, Text, is the actual text of the tweet (which can include emoji):

RT @IvankaTrump: 2016 has been one of the most eventful and exciting years of my life. I wish you peace, joy, love and laughter. Hap…
RT @DonaldJTrumpJr: Happy new year everyone. #newyear #family #vacation #familytime
RT @EricTrump: 2016 was such an incredible year for our entire family! My beautiful wife @LaraLeaTrump made it even better! 🇺🇸🇺🇸
RT @Reince: Happy New Year + God’s blessings to you all. Looking forward to incredible things in 2017! @realDonaldTrump will Make America Great Again!

RT @DanScavino: On behalf of our next #POTUS & @TeamTrump-

#HappyNewYear AMERICA🇺🇸

Note: To do a time-series analysis, we don’t really care about the Text column.


Note: these steps should be done from either the McClatchy iMacs. Or from Stanford Farmshare.

Step 1. Download the file

Let’s just do Trump for now. The direct URL to download his Trump data is:

You should already know how to download that data with the browser. Here’s how to do it via the command-line:

$ curl

Step 2. Actually save the data to a local file

That should have dumped a bunch of text onto your screen, i.e. “standard out”. We don’t want that. We would rather save the contents of that URL into a local text file.

Let’s call it trumptweets.csv

$ curl > trumptweets.csv

What’s in that file? If you downloaded it to your own computer, try opening it in Excel.

Or, if you want to print the first 10 lines to standard output (i.e. your screen), try this:

$ head -n 10 trumptweets.csv

Step 3. Use csvcut to filter the data file by a single column

I apologize for not having a thorough writeup on the csvcut tool, though it is pretty easy to figure out (which we will do in class). But here’s a nice example from

Here are the 2 options we want to try:

$ csvcut -n trumptweets.csv

And then:

$ csvcut -c 'Posted at' trumptweets.csv

Step 4. Filter for patterns

Now, use ack to apply regex filtering.

Guide: ack - better than grep

Think of the pattern needed to match the “year-month” of the timestamp. Or, just the hour.

Step 5. Do a group count

Guide: sort - sort lines in a stream

How does this apply to the homework. Here’s what the shell script for counting Trump tweets per day might look like:

echo 'day,tweet_count'
curl > trumptweets.csv
csvcut -c 'Posted at' trumptweets.csv \
    | ack -o '\d{4}-\d{2}-\d{2}' \
    | sort \
    | uniq -c \
    | sort -rn