Hello Data De-Serialization with JSON and CSV¶
Just in case you need more practice with the concept of converting a string of text into Python data objects such as dictionaries and lists, here are 16 exercises involving a very trivial, nonsensical dataset that has been serialized into JSON and CSV.
Contents
- Hello Data De-Serialization with JSON and CSV
- What you should know
- The questions
- 1. In the metadata, what is the value for the status key?
- 2. How many inventory records are there?
- 3. Get an alphabetized list of the names of each inventory item
- 4. Get a sum of all the inventory item counts
- 5. Filter inventory for the “animal” items, and create ordered list of [‘animalname’, animalcount] pairs
- 6. Filter inventory for “fruit” items, create ordered list of trimmed-dictionary objects
- 7. Do a group count of inventory items by their “type”
- 8. Do a summation of item counts, grouped by “type”
- Information and Hints
- Solutions
What you should know¶
About the data¶
Don’t try to make real-world sense of the data: it’s too dumb and simple to describe in real-world terms. Just care about the structure
http://stash.compciv.org/2017/helloworld.json
http://stash.compciv.org/2017/helloworld.csv
Note: in order to make it so both text files conveyed roughly the same information, I deliberately made the CSV file, well, not a CSV file by throwing in unstructured text at the top of the file. This is actually something you’ll see in real-world datasets, where a dataset owner will insert text meant as metadata, such as a copyright notice or contact address, which will cause CSV-parsing programs such as Excel to think that the actual “data” is messed up.
So how to get around this? Remember that a CSV text file, when opened and read, is just a plain Python string. Are there parts of that string that are irrelevant to what you want to send to the CSV parser, i.e. csv.reader()
? Then don’t send those parts of the string.
About Python and data formats¶
You should be familiar enough with the csv and json built-in libraries and methods for serializing text strings into data objects:
import csv
import json
data = csv.reader(csvtext.splitlines())
# or...
data = csv.DictReader(csvtext.splitlines())
# or...
data = json.loads(jsontext)
# or...
data = json.load(jsonfilename)
And you should know how to tell the difference between a Python list and a Python dict, and how to get around the internals of their respective structures.
Particularly:
- That lists are “zero-indexed”
- What
KeyError
andIndexError
mean. - The difference between
mydict['somekey']
andmydict.get('somekey')
(assumingmydict
is a dictionary) - The very important difference between
mydict[5]
andmydict['5']
. - The difference between
mydict.keys()
,mydict.values()
, andmydict.items()
- The difference between
mylist.append(5)
andmylist.append([5])
, assumingmylist
is a list - The difference between
mylist.append([5])
andmylist.extend([5])
, assumingmylist
is a list
And, of course, how to create a sorted copy of a dictionary or list, sorted by any key/field you want.
Relevant readings¶
- “Automate” chapter on Lists: https://automatetheboringstuff.com/chapter4/
- “Automate” chapter on Dictionaries and Structuring DatA: https://automatetheboringstuff.com/chapter5/
- “Automate” chapter on Working with CSV files and JSON data: https://automatetheboringstuff.com/chapter14/
- An introduction to data serialization and Python Requests http://www.compjour.org/tutorials/intro-to-python-requests-and-json/
- A quiz on dicts and lists: http://www.compjour.org/homework/json-quiz-part-1/
- csv - reading and writing delimited text data
- Python’s documentation for the csv library: https://docs.python.org/3/library/csv.html
- Python’s documentation for the json library: https://docs.python.org/3/library/json.html
- Sorting in Python with the sorted method(): http://www.compciv.org/guides/python/fundamentals/sorting-collections-with-sorted/
Bored of loops and conditionals?¶
If you have programming experience from CS106 and are wanting to practice something other than the programming fundamentals of loops, conditionals, and basic data structures, then attempt to solve these data exercises the “Pythonic” way.
Use a list comprehension instead of a for-loop to build a new list
http://python-3-patterns-idioms-test.readthedocs.io/en/latest/Comprehensions.html
Instead of:
newlist = []
for x in oldlist:
myval = x['mykey']
newlist.append(myval)
Try:
newlist = [x['mykey'] for x in oldlist]
Use specialized data structures from the collections module
Python’s collections module has a few variations on the standard Python list and dict types:
https://docs.python.org/3/library/collections.html
For example, given this list of lists:
vote_results = [
['Trump', 48],
['Clinton', 46],
['Trump', 24],
['Clinton', 23],
['Gary', 2]
]
We want to get do a group count by grouping the lists by their “name” values (e.g. 'Trump', 'Clinton'
) and summing up their “count” values, e.g. 48
and 23
. The result should be a dictionary like this:
{'Clinton': 69, 'Gary': 2, 'Trump': 72}
To do this using just standard loops and a dictionary:
tally = {}
for row in vote_results:
candidate = row[0]
votes = row[1]
if tally.get(candidate):
tally[candidate] += votes
else:
tally[candidate] = votes
But here’s one alternative simplification using the defaultdict
type:
from collections import defaultdict
tally = defaultdict(int)
for row in vote_results:
candidate = row[0]
votes = row[1]
tally[candidate] += votes
Unpacking argument lists
https://docs.python.org/3/tutorial/controlflow.html#unpacking-argument-lists
Instead of this:
results = [['Trump', 48, 'Iowa'], ['Clinton', 46, 'Michigan']]
for row in results:
candidate = row[0]
votes = row[1]
state = row[2]
Do:
results = [['Trump', 48, 'Iowa'], ['Clinton', 46, 'Michigan']]
for row in results:
candidate, votes, state = row
The questions¶
All of these questions are meant to be answered by writing functions that return the desired value. The questions apply to both the CSV and JSON version of the data, and the ansewrs should be the same.
Basically, create a single Python script file. And for each question, create an appropriately-numbered function, e.g. foo_1
through foo_8
. And each function should have a return statement.
Also, each function should be self-contained in that they download from the relevant data URL and deserialize the downlaoded text into a Python object.
3. Get an alphabetized list of the names of each inventory item¶
Expected result:
['apples', 'cats', 'dogs', 'kiwis', 'zebras']
5. Filter inventory for the “animal” items, and create ordered list of [‘animalname’, animalcount] pairs¶
Expected result:
[['zebras', 180], ['dogs', 42], ['cats', 9]]
6. Filter inventory for “fruit” items, create ordered list of trimmed-dictionary objects¶
Expected result:
[{'count': 76, 'name': 'kiwis'}, {'count': 300, 'name': 'apples'}]
Information and Hints¶
Even though the solutions to each question should be packaged as a standalone function, you should be using the interactive ipython
shell to walk through the steps. After you’ve confirmed that each line of your program works, then you re-write it in your text editor as a function inside a Python script.
(More hints to come, maybe)
Organizing these examples as a project¶
I ask that you solve these exercises by creating a file and writing a function for each problem. So if you want to solve just the JSON exercises, create a Python script named json_fun.py
Then add a define a function named foo_x()
:
def foo_x():
return 'just testing'
To interactively test out your script, you can jump into the ipython
shell, relative to your working directory (i.e. in the same directory as your python scripts). And then you include your script as if it were any other Python module:
>>> import json_fun
>>> json_fun.foo_x()
>>> 'just testing'
For certain homework assignments, I also include a foo_assertions()
function, which contains a long list of assert
statements. Basically, it is automated testing of what your foo_1()
, etc. functions should be returning.
Here’s an example of a script that has the first 3 JSON problems done and also contains a foo_assertions()` function that runs a couple of tests on each foo_()
function. If you alter any of the sample functions to return something that you know is wrong, you’ll see what foo_assertions() does.
import json
import requests
from os.path import basename, exists, join
from os import makedirs
DEST_DIR = 'data-files'
DATA_URL = 'http://stash.compciv.org/2017/helloworld.json'
def fetch_and_save_url(url):
"""
For a given URL, creates a filename to save to
Checks to see if filename already exists; if not, download and save to that file name
Either way, return the filename as a string
"""
makedirs(DEST_DIR, exist_ok=True)
dest_filename = join(DEST_DIR, basename(url))
if not exists(dest_filename):
resp = requests.get(url)
with open(dest_filename, 'wb') as f:
f.write(resp.content)
return dest_filename
def parse_data():
fname = fetch_and_save_url(DATA_URL)
thefile = open(fname, 'r')
rawtxt = thefile.read()
thefile.close()
return json.loads(rawtxt)
def foo_1():
jdata = parse_data()
return jdata['status']
def foo_2():
jdata = parse_data()
items = jdata['inventory']
return len(items)
def foo_3():
jdata = parse_data()
items = jdata['inventory']
y = []
for i in items:
y.append(i['name'])
return sorted(y)
def foo_assertions():
x = foo_1()
assert type(x) is str, 'Expect foo_2() to return an str'
assert x == 'SUPER!', 'Expect the "status" key of the data to havevalue of "SUPER!"'
x = foo_2()
assert type(x) is int, 'Expect foo_2() should return an int'
assert x == 5, 'foo_2() Expect that foo_2() finds that there are 5 itmes'
x = foo_3()
assert type(x) is list, 'Expect foo_3() should return a list'
assert x == ['apples', 'cats', 'dogs', 'kiwis', 'zebras'], 'Expect the list of item names to be in this sorted order'
if __name__ == '__main__':
foo_assertions()
print("Done with assertions!")
The data as text¶
Here’s what the JSON looks like:
http://stash.compciv.org/2017/helloworld.json
{
"status": "SUPER!",
"hello": "world",
"inventory": [
{
"name": "dogs",
"type": "animal",
"count": 42
},
{
"name": "apples",
"type": "fruit",
"count": 300
},
{
"name": "kiwis",
"type": "fruit",
"count": 76
},
{
"name": "zebras",
"type": "animal",
"count": 180
},
{
"name": "cats",
"type": "animal",
"count": 9
}
]
}
And here’s what the CSV version looks like:
http://stash.compciv.org/2017/helloworld.csv
status:SUPER!
hello:world
---
name,type,count
dogs,animal,42
apples,fruit,300
kiwis,fruit,76
zebras,animal,180
cats,animal,9
Solutions¶
Complete solution for the JSON-formatted data¶
(doesn’t include assertions – you can write those yourself.)
import json
import requests
SRC_URL = 'http://stash.compciv.org/2017/helloworld.json'
def foo_1():
"""
in helloworld.json
return the value of the 'status' key/attribute
expected:
'SUPER!'
"""
resp = requests.get(SRC_URL)
txt = resp.text
jdata = json.loads(txt)
return jdata['status']
def foo_2():
"""
in helloworld.json
return the number of items in the 'inventory'
expected:
5
"""
jdata = json.loads(requests.get(SRC_URL).text)
inventory = jdata['inventory']
return(len(inventory))
def foo_3():
"""
in helloworld.json
return an alphabetized list of 'inventory' item names
expected:
['apples', 'cats', 'dogs', 'kiwis', 'zebras']
"""
inventory = json.loads(requests.get(SRC_URL).text)['inventory']
nameslist = []
for item in inventory:
itemname = item['name']
nameslist.append(itemname)
sortednames = sorted(nameslist)
return sortednames
def foo_4():
"""
in helloworld.json
return the sum of inventory counts
expected:
607
"""
thesum = 0
for item in json.loads(requests.get(SRC_URL).text)['inventory']:
c = item['count']
thesum += c
return thesum
def foo_5():
"""
from helloworld.json
filter inventory for just animals
return a list of lists, with each sublist containing animal name and count (as integer)
and sorted in descending order of count
expected:
[['zebras', 180], ['dogs', 42], ['cats', 9]]
"""
def sorter(thing):
return thing[1]
thelist = []
for item in json.loads(requests.get(SRC_URL).text)['inventory']:
if item['type'] == 'animal':
n = item['name']
c = item['count']
thelist.append([n, c])
return sorted(thelist, key=sorter, reverse=True)
def foo_6():
"""
from helloworld.json
filter inventory for just fruits
return a list of dicts, with each sublist containing fruit name and count (as integer)
and sorted in ascending order of count
[{'count': 76, 'name': 'kiwis'}, {'count': 300, 'name': 'apples'}]
"""
def sorter(thing):
return thing['count']
thelist = []
for item in json.loads(requests.get(SRC_URL).text)['inventory']:
if item['type'] == 'fruit':
d = {}
d['name'] = item['name']
d['count'] = item['count']
thelist.append(d)
return sorted(thelist, key=sorter, reverse=False)
def foo_7():
"""
from helloworld.json
do a group count of the 'inventory' by 'type', and get a count how many unique items there are by name
return a dictionary with each key-value pair consisting of the 'type' and count of unique item names
{'animal': 3, 'fruit': 2}
"""
thedict = {}
inventory = json.loads(requests.get(SRC_URL).text)['inventory']
for item in inventory:
itype = item['type']
if thedict.get(itype):
thedict[itype] += 1
else:
thedict[itype] = 1
return thedict
def foo_8():
"""
from helloworld.json
do a group count of the 'inventory' by 'type', summing up the counts of each item for every given type
return a dictionary with each key-value pair consisting of the 'type' and sum of item counts
expected:
{'animal': 231, 'fruit': 376}
"""
thedict = {}
inventory = json.loads(requests.get(SRC_URL).text)['inventory']
for item in inventory:
itype = item['type']
if thedict.get(itype):
thedict[itype] += item['count']
else:
thedict[itype] = item['count']
return thedict
Complete solution for the CSV-formatted data¶
import csv
import requests
SRC_URL = 'http://stash.compciv.org/2017/helloworld.csv'
def foo_1():
"""
in helloworld.csv
return the 'status' value in the file's "metadata"
expected:
'SUPER!'
"""
resp = requests.get(SRC_URL)
txt = resp.text
lines = txt.splitlines()
# look through each line
for line in lines:
if 'status:' in line:
keyvalpair = line.split(':')
return keyvalpair[1]
def foo_2():
"""
in helloworld.csv
return the number of records
expected:
5
"""
resp = requests.get(SRC_URL)
lines = resp.text.splitlines()
# headers are on line 4, i.e. index 3
datalines = lines[3:]
records = list(csv.DictReader(datalines))
return len(records)
def foo_3():
"""
in helloworld.csv
return an alphabetized list of 'inventory' item names
expected:
['apples', 'cats', 'dogs', 'kiwis', 'zebras']
"""
resp = requests.get(SRC_URL)
datalines = resp.text.splitlines()[3:]
records = list(csv.DictReader(datalines))
nameslist = []
for item in records:
itemname = item['name']
nameslist.append(itemname)
sortednames = sorted(nameslist)
return sortednames
def foo_4():
"""
in helloworld.csv
return the sum of inventory counts
expected:
607
"""
resp = requests.get(SRC_URL)
datalines = resp.text.splitlines()[3:]
records = list(csv.DictReader(datalines))
thesum = 0
for item in records:
c = int(item['count'])
thesum += c
return thesum
def foo_5():
"""
from helloworld.csv
filter inventory for just animals
return a list of lists, with each sublist containing animal name and count (as integer)
and sorted in descending order of count
expected:
[['zebras', 180], ['dogs', 42], ['cats', 9]]
"""
resp = requests.get(SRC_URL)
records = list(csv.DictReader(resp.text.splitlines()[3:]))
def sorter(thing):
return thing[1]
thelist = []
for item in records:
if item['type'] == 'animal':
n = item['name']
c = int(item['count'])
thelist.append([n, c])
return sorted(thelist, key=sorter, reverse=True)
def foo_6():
"""
from helloworld.csv
filter inventory for just fruits
return a list of dicts, with each sublist containing fruit name and count (as integer)
and sorted in ascending order of count
expected:
[{'count': 76, 'name': 'kiwis'}, {'count': 300, 'name': 'apples'}]
"""
resp = requests.get(SRC_URL)
records = list(csv.DictReader(resp.text.splitlines()[3:]))
def sorter(thing):
return thing['count']
thelist = []
for item in records:
if item['type'] == 'fruit':
d = {}
d['name'] = item['name']
d['count'] = int(item['count'])
thelist.append(d)
return sorted(thelist, key=sorter, reverse=False)
def foo_7():
"""
from helloworld.csv
do a group count of the 'inventory' by 'type', and get a count how many unique items there are by name
return a dictionary with each key-value pair consisting of the 'type' and count of unique item names
expected:
{'animal': 3, 'fruit': 2}
"""
resp = requests.get(SRC_URL)
records = list(csv.DictReader(resp.text.splitlines()[3:]))
thedict = {}
for item in records:
itype = item['type']
if thedict.get(itype):
thedict[itype] += 1
else:
thedict[itype] = 1
return thedict
def foo_8():
"""
from helloworld.csv
do a group count of the 'inventory' by 'type', summing up the counts of each item for every given type
return a dictionary with each key-value pair consisting of the 'type' and sum of item counts
expected:
{'animal': 231, 'fruit': 376}
"""
resp = requests.get(SRC_URL)
records = list(csv.DictReader(resp.text.splitlines()[3:]))
thedict = {}
for item in records:
itype = item['type']
if thedict.get(itype):
thedict[itype] += int(item['count'])
else:
thedict[itype] = int(item['count'])
return thedict