Hello Data De-Serialization with JSON and CSV

Just in case you need more practice with the concept of converting a string of text into Python data objects such as dictionaries and lists, here are 16 exercises involving a very trivial, nonsensical dataset that has been serialized into JSON and CSV.

What you should know

About the data

Don’t try to make real-world sense of the data: it’s too dumb and simple to describe in real-world terms. Just care about the structure

http://stash.compciv.org/2017/helloworld.json

http://stash.compciv.org/2017/helloworld.csv

Note: in order to make it so both text files conveyed roughly the same information, I deliberately made the CSV file, well, not a CSV file by throwing in unstructured text at the top of the file. This is actually something you’ll see in real-world datasets, where a dataset owner will insert text meant as metadata, such as a copyright notice or contact address, which will cause CSV-parsing programs such as Excel to think that the actual “data” is messed up.

So how to get around this? Remember that a CSV text file, when opened and read, is just a plain Python string. Are there parts of that string that are irrelevant to what you want to send to the CSV parser, i.e. csv.reader()? Then don’t send those parts of the string.

About Python and data formats

You should be familiar enough with the csv and json built-in libraries and methods for serializing text strings into data objects:

import csv
import json

data = csv.reader(csvtext.splitlines())
# or...
data = csv.DictReader(csvtext.splitlines())

# or...
data = json.loads(jsontext)

# or...
data = json.load(jsonfilename)

And you should know how to tell the difference between a Python list and a Python dict, and how to get around the internals of their respective structures.

Particularly:

  • That lists are “zero-indexed”
  • What KeyError and IndexError mean.
  • The difference between mydict['somekey'] and mydict.get('somekey') (assuming mydict is a dictionary)
  • The very important difference between mydict[5] and mydict['5'].
  • The difference between mydict.keys(), mydict.values(), and mydict.items()
  • The difference between mylist.append(5) and mylist.append([5]), assuming mylist is a list
  • The difference between mylist.append([5]) and mylist.extend([5]), assuming mylist is a list

And, of course, how to create a sorted copy of a dictionary or list, sorted by any key/field you want.

Relevant readings

Bored of loops and conditionals?

If you have programming experience from CS106 and are wanting to practice something other than the programming fundamentals of loops, conditionals, and basic data structures, then attempt to solve these data exercises the “Pythonic” way.

Use a list comprehension instead of a for-loop to build a new list

http://python-3-patterns-idioms-test.readthedocs.io/en/latest/Comprehensions.html

Instead of:

newlist = []
for x in oldlist:
    myval = x['mykey']
    newlist.append(myval)

Try:

newlist = [x['mykey'] for x in oldlist]

Use specialized data structures from the collections module

Python’s collections module has a few variations on the standard Python list and dict types:

https://docs.python.org/3/library/collections.html

For example, given this list of lists:

vote_results = [
    ['Trump', 48],
    ['Clinton', 46],
    ['Trump', 24],
    ['Clinton', 23],
    ['Gary', 2]
]

We want to get do a group count by grouping the lists by their “name” values (e.g. 'Trump', 'Clinton') and summing up their “count” values, e.g. 48 and 23. The result should be a dictionary like this:

{'Clinton': 69, 'Gary': 2, 'Trump': 72}

To do this using just standard loops and a dictionary:

tally = {}
for row in vote_results:
    candidate = row[0]
    votes = row[1]
    if tally.get(candidate):
        tally[candidate] += votes
    else:
        tally[candidate] = votes

But here’s one alternative simplification using the defaultdict type:

from collections import defaultdict
tally = defaultdict(int)
for row in vote_results:
    candidate = row[0]
    votes = row[1]
    tally[candidate] += votes

Unpacking argument lists

https://docs.python.org/3/tutorial/controlflow.html#unpacking-argument-lists

Instead of this:

results = [['Trump', 48, 'Iowa'], ['Clinton', 46, 'Michigan']]
for row in results:
    candidate = row[0]
    votes = row[1]
    state = row[2]

Do:

results = [['Trump', 48, 'Iowa'], ['Clinton', 46, 'Michigan']]
for row in results:
    candidate, votes, state = row

The questions

All of these questions are meant to be answered by writing functions that return the desired value. The questions apply to both the CSV and JSON version of the data, and the ansewrs should be the same.

Basically, create a single Python script file. And for each question, create an appropriately-numbered function, e.g. foo_1 through foo_8. And each function should have a return statement.

Also, each function should be self-contained in that they download from the relevant data URL and deserialize the downlaoded text into a Python object.

3. Get an alphabetized list of the names of each inventory item

Expected result:

['apples', 'cats', 'dogs', 'kiwis', 'zebras']

6. Filter inventory for “fruit” items, create ordered list of trimmed-dictionary objects

Expected result:

[{'count': 76, 'name': 'kiwis'}, {'count': 300, 'name': 'apples'}]

7. Do a group count of inventory items by their “type”

Expected result:

{'animal': 3, 'fruit': 2}

8. Do a summation of item counts, grouped by “type”

Expected result:

{'animal': 231, 'fruit': 376}

Information and Hints

Even though the solutions to each question should be packaged as a standalone function, you should be using the interactive ipython shell to walk through the steps. After you’ve confirmed that each line of your program works, then you re-write it in your text editor as a function inside a Python script.

(More hints to come, maybe)

Organizing these examples as a project

I ask that you solve these exercises by creating a file and writing a function for each problem. So if you want to solve just the JSON exercises, create a Python script named json_fun.py

Then add a define a function named foo_x():

def foo_x():
    return 'just testing'

To interactively test out your script, you can jump into the ipython shell, relative to your working directory (i.e. in the same directory as your python scripts). And then you include your script as if it were any other Python module:

>>> import json_fun
>>> json_fun.foo_x()
>>> 'just testing'

For certain homework assignments, I also include a foo_assertions() function, which contains a long list of assert statements. Basically, it is automated testing of what your foo_1(), etc. functions should be returning.

Here’s an example of a script that has the first 3 JSON problems done and also contains a foo_assertions()` function that runs a couple of tests on each foo_() function. If you alter any of the sample functions to return something that you know is wrong, you’ll see what foo_assertions() does.

import json
import requests
from os.path import basename, exists, join
from os import makedirs

DEST_DIR = 'data-files'

DATA_URL = 'http://stash.compciv.org/2017/helloworld.json'

def fetch_and_save_url(url):
    """
    For a given URL, creates a filename to save to
    Checks to see if filename already exists; if not, download and save to that file name

    Either way, return the filename as a string
    """
    makedirs(DEST_DIR, exist_ok=True)
    dest_filename = join(DEST_DIR, basename(url))
    if not exists(dest_filename):
        resp = requests.get(url)
        with open(dest_filename, 'wb') as f:
            f.write(resp.content)
    return dest_filename


def parse_data():
    fname = fetch_and_save_url(DATA_URL)
    thefile = open(fname, 'r')
    rawtxt = thefile.read()
    thefile.close()

    return json.loads(rawtxt)


def foo_1():
    jdata = parse_data()
    return jdata['status']


def foo_2():
    jdata = parse_data()
    items = jdata['inventory']
    return len(items)


def foo_3():
    jdata = parse_data()
    items = jdata['inventory']
    y = []
    for i in items:
        y.append(i['name'])

    return sorted(y)




def foo_assertions():
    x = foo_1()
    assert type(x) is str, 'Expect foo_2() to return an str'
    assert x == 'SUPER!', 'Expect the "status" key of the data to havevalue of "SUPER!"'

    x = foo_2()
    assert type(x) is int, 'Expect foo_2() should return an int'
    assert x == 5, 'foo_2() Expect that foo_2() finds that there are 5 itmes'

    x = foo_3()
    assert type(x) is list, 'Expect foo_3() should return a list'
    assert x == ['apples', 'cats', 'dogs', 'kiwis', 'zebras'], 'Expect the list of item names to be in this sorted order'



if __name__ == '__main__':
    foo_assertions()
    print("Done with assertions!")

The data as text

Here’s what the JSON looks like:

http://stash.compciv.org/2017/helloworld.json

{
  "status": "SUPER!",
  "hello": "world",
  "inventory": [
    {
      "name": "dogs",
      "type": "animal",
      "count": 42
    },
    {
      "name": "apples",
      "type": "fruit",
      "count": 300
    },
    {
      "name": "kiwis",
      "type": "fruit",
      "count": 76
    },
    {
      "name": "zebras",
      "type": "animal",
      "count": 180
    },
    {
      "name": "cats",
      "type": "animal",
      "count": 9
    }
  ]
}

And here’s what the CSV version looks like:

http://stash.compciv.org/2017/helloworld.csv

status:SUPER!
hello:world
---
name,type,count
dogs,animal,42
apples,fruit,300
kiwis,fruit,76
zebras,animal,180
cats,animal,9

Solutions

Complete solution for the JSON-formatted data

(doesn’t include assertions – you can write those yourself.)

import json
import requests

SRC_URL = 'http://stash.compciv.org/2017/helloworld.json'

def foo_1():
    """
    in helloworld.json
    return the value of the 'status' key/attribute

    expected:
    'SUPER!'
    """
    resp = requests.get(SRC_URL)
    txt = resp.text
    jdata = json.loads(txt)
    return jdata['status']


def foo_2():
    """
    in helloworld.json
    return the number of items in the 'inventory'

    expected:
    5
    """
    jdata = json.loads(requests.get(SRC_URL).text)
    inventory = jdata['inventory']
    return(len(inventory))



def foo_3():
    """
    in helloworld.json
    return an alphabetized list of 'inventory' item names

    expected:
    ['apples', 'cats', 'dogs', 'kiwis', 'zebras']
    """
    inventory = json.loads(requests.get(SRC_URL).text)['inventory']
    nameslist = []
    for item in inventory:
        itemname = item['name']
        nameslist.append(itemname)
    sortednames = sorted(nameslist)
    return sortednames



def foo_4():
    """
    in helloworld.json
    return the sum of inventory counts

    expected:
    607
    """
    thesum = 0
    for item in json.loads(requests.get(SRC_URL).text)['inventory']:
        c = item['count']
        thesum += c
    return thesum


def foo_5():
    """
    from helloworld.json
    filter inventory for just animals
    return a list of lists, with each sublist containing animal name and count (as integer)
        and sorted in descending order of count

    expected:
    [['zebras', 180], ['dogs', 42], ['cats', 9]]

    """

    def sorter(thing):
        return thing[1]


    thelist = []
    for item in json.loads(requests.get(SRC_URL).text)['inventory']:
        if item['type'] == 'animal':
            n = item['name']
            c = item['count']
            thelist.append([n, c])

    return sorted(thelist, key=sorter, reverse=True)



def foo_6():
    """
    from helloworld.json
    filter inventory for just fruits
    return a list of dicts, with each sublist containing fruit name and count (as integer)
        and sorted in ascending order of count


    [{'count': 76, 'name': 'kiwis'}, {'count': 300, 'name': 'apples'}]
    """
    def sorter(thing):
        return thing['count']

    thelist = []
    for item in json.loads(requests.get(SRC_URL).text)['inventory']:
        if item['type'] == 'fruit':
            d = {}
            d['name'] = item['name']
            d['count'] = item['count']
            thelist.append(d)

    return sorted(thelist, key=sorter, reverse=False)



def foo_7():
    """
    from helloworld.json
    do a group count of the 'inventory' by 'type', and get a count how many unique items there are by name
    return a dictionary with each key-value pair consisting of the 'type' and count of unique item names

    {'animal': 3, 'fruit': 2}
    """
    thedict = {}
    inventory = json.loads(requests.get(SRC_URL).text)['inventory']
    for item in inventory:
        itype = item['type']
        if thedict.get(itype):
            thedict[itype] += 1
        else:
            thedict[itype] = 1
    return thedict



def foo_8():
    """
    from helloworld.json
    do a group count of the 'inventory' by 'type', summing up the counts of each item for every given type
    return a dictionary with each key-value pair consisting of the 'type' and sum of item counts

    expected:
    {'animal': 231, 'fruit': 376}

    """
    thedict = {}
    inventory = json.loads(requests.get(SRC_URL).text)['inventory']
    for item in inventory:
        itype = item['type']
        if thedict.get(itype):
            thedict[itype] += item['count']
        else:
            thedict[itype] = item['count']
    return thedict

Complete solution for the CSV-formatted data

import csv
import requests

SRC_URL = 'http://stash.compciv.org/2017/helloworld.csv'

def foo_1():
    """
    in helloworld.csv
    return the 'status' value in the file's "metadata"

    expected:
    'SUPER!'

    """
    resp = requests.get(SRC_URL)
    txt = resp.text
    lines = txt.splitlines()
    # look through each line
    for line in lines:
        if 'status:' in line:
            keyvalpair = line.split(':')
            return keyvalpair[1]


def foo_2():
    """
    in helloworld.csv
    return the number of records

    expected:
    5
    """
    resp = requests.get(SRC_URL)
    lines = resp.text.splitlines()
    # headers are on line 4, i.e. index 3
    datalines = lines[3:]
    records = list(csv.DictReader(datalines))

    return len(records)



def foo_3():
    """
    in helloworld.csv
    return an alphabetized list of 'inventory' item names

    expected:
    ['apples', 'cats', 'dogs', 'kiwis', 'zebras']
    """
    resp = requests.get(SRC_URL)
    datalines = resp.text.splitlines()[3:]
    records = list(csv.DictReader(datalines))

    nameslist = []
    for item in records:
        itemname = item['name']
        nameslist.append(itemname)
    sortednames = sorted(nameslist)

    return sortednames



def foo_4():
    """
    in helloworld.csv
    return the sum of inventory counts

    expected:
    607
    """
    resp = requests.get(SRC_URL)
    datalines = resp.text.splitlines()[3:]
    records = list(csv.DictReader(datalines))

    thesum = 0
    for item in records:
        c = int(item['count'])
        thesum += c
    return thesum


def foo_5():
    """
    from helloworld.csv
    filter inventory for just animals
    return a list of lists, with each sublist containing animal name and count (as integer)
        and sorted in descending order of count

    expected:
    [['zebras', 180], ['dogs', 42], ['cats', 9]]

    """
    resp = requests.get(SRC_URL)
    records = list(csv.DictReader(resp.text.splitlines()[3:]))

    def sorter(thing):
        return thing[1]

    thelist = []
    for item in records:
        if item['type'] == 'animal':
            n = item['name']
            c = int(item['count'])
            thelist.append([n, c])

    return sorted(thelist, key=sorter, reverse=True)



def foo_6():
    """
    from helloworld.csv
    filter inventory for just fruits
    return a list of dicts, with each sublist containing fruit name and count (as integer)
        and sorted in ascending order of count

    expected:
    [{'count': 76, 'name': 'kiwis'}, {'count': 300, 'name': 'apples'}]
    """
    resp = requests.get(SRC_URL)
    records = list(csv.DictReader(resp.text.splitlines()[3:]))

    def sorter(thing):
        return thing['count']

    thelist = []
    for item in records:
        if item['type'] == 'fruit':
            d = {}
            d['name'] = item['name']
            d['count'] = int(item['count'])
            thelist.append(d)

    return sorted(thelist, key=sorter, reverse=False)



def foo_7():
    """
    from helloworld.csv
    do a group count of the 'inventory' by 'type', and get a count how many unique items there are by name
    return a dictionary with each key-value pair consisting of the 'type' and count of unique item names

    expected:
    {'animal': 3, 'fruit': 2}
    """
    resp = requests.get(SRC_URL)
    records = list(csv.DictReader(resp.text.splitlines()[3:]))

    thedict = {}
    for item in records:
        itype = item['type']
        if thedict.get(itype):
            thedict[itype] += 1
        else:
            thedict[itype] = 1
    return thedict



def foo_8():
    """
    from helloworld.csv
    do a group count of the 'inventory' by 'type', summing up the counts of each item for every given type
    return a dictionary with each key-value pair consisting of the 'type' and sum of item counts

    expected:
    {'animal': 231, 'fruit': 376}
    """
    resp = requests.get(SRC_URL)
    records = list(csv.DictReader(resp.text.splitlines()[3:]))

    thedict = {}
    for item in records:
        itype = item['type']
        if thedict.get(itype):
            thedict[itype] += int(item['count'])
        else:
            thedict[itype] = int(item['count'])
    return thedict