Generating probabilistic coherence with a simple Markov-based bot

Your bot doesn’t have to be limited to analyzing the one piece of data that triggered it, as in following the footsteps of the @stealthmountain bot: A Peak of Grammar Correction with Twython

By now, you should be able to authenticate a Twitter application and make it accessible via Twython:

Creating a Twitter Application for Programmatic Access with Twython

Be familiar with Twython, the Twitter API, and Twitter in general:

Exploring the basics of the Twitter API with Twython

Markov and horses

One of the most famous Twitter bots is @horse_ebooks

https://twitter.com/horse_ebooks

Assumed to be a bot whose purpose was to advertise schlocky ebooks, the spam it generated to ostensibly seem “human” brought the kind of intellectual delight normally reserved for highbrow art:

Soon the metaness of @horse_ebooks outgrew its output, as journalists obsessed with finding out @horse_ebooks’s creator:

http://gawker.com/5887697/how-i-found-the-human-being-behind-horseebooks-the-internets-favorite-spambot

But it was tricky. A human being behind Horse_ebooks could either intensify or diminish its myth. Horse_ebooks itself would be elevated from a dumb spam bot that had chanced into greatness to a brilliant viral marketing tool. But Horse_ebooks fans would be debased, transformed from connoisseurs of sophisticated anti-humor to the unwitting cash cows of some Russian mastermind.

Because even today, Horse_ebooks has the cold heart of a spammer. The links Horse_ebooks tweets in between its beautiful nonsense lead to pages of bullshit products—”Divorce Secrets Every Woman Should Know” was the latest—plugged into the Clickbank affiliate marketing network. Someone is making money from the sales and clicks generated by Horse_ebooks. Social media consultants obsess over cultivating “engagement” with their audience, and Horse_ebooks’ audience must be the most engaged on the web. It’s worth remembering the American programmer who bragged on his blog in 2010 about how he used a Twitter spam bot not unlike Horse_ebooks to milk Twitter users for cash through Amazon’s affiliate program.

The mystery of @horse_ebooks ended abruptly when a BuzzFeed creative director told the world that he had been manually operating @horse_ebooks, and what everyone thought was the product of probability, was in fact his attempt at performance art:

https://www.theguardian.com/media/2013/sep/24/10-reasons-buzzfeed-ruins-everything

People usually think that any art created by an algorithm loses the quality that humans bring to art. In the case of @horse_ebooks, people saw “art” in what they assumed was probability, and were angrily disappointed when it was all just someone pretending to be probability.

About Markov chains

Start with this great visual explainer of Markov Chains by Victor Powell and Lewis Lehe:

http://setosa.io/ev/markov-chains/

Markov chains, named after Andrey Markov, are mathematical systems that hop from one “state” (a situation or set of values) to another. For example, if you made a Markov chain model of a baby’s behavior, you might include “playing,” “eating”, “sleeping,” and “crying” as states, which together with other behaviors could form a ‘state space’: a list of all possible states. In addition, on top of the state space, a Markov chain tells you the probabilitiy of hopping, or “transitioning,” from one state to any other state—e.g., the chance that a baby currently playing will fall asleep in the next five minutes without crying first.

Here’s my lame summation:

A Markov chain produces a new “state”, or “step”, based on what its current and past states were. In terms of making human-sounding random sentences, imagine a Markov chain with the current state of “puppy”. To generate the next state, the Markov chain, based on a body of text it has trained on, has a 70% chance of choosing “barks”, a 20% chance of choosing “sleeps”, a 9% chance of choosing “chow”, and a 1% chance of choosing “dies”.

In other words, generating random but intelligible phrases is hugely dependent on efficiently collecting training data.

The markovify library

In fact, we’ll delegate the work of implementing the Markov model to Jeremy Singer-Vine’s markovify: https://github.com/jsvine/markovify

You can install it at the command-line via pip:

$ pip install markovify

Here’s the documented “basic usage” case:

import markovify

# Get raw text as string.
with open("/path/to/my/corpus.txt") as f:
    text = f.read()

# Build the model.
text_model = markovify.Text(text)

# Print five randomly-generated sentences
for i in range(5):
    print(text_model.make_sentence())

# Print three randomly-generated sentences of no more than 140 characters
for i in range(3):
    print(text_model.make_short_sentence(140))

And check out the examples of markovify in the wild:

  • SubredditSimulator, which generates random Reddit submissions and comments based on a subreddit’s previous activity: https://www.reddit.com/r/SubredditSimulator/
  • @MarkovPicard: a Twitter bot based on Star Trek: The Next Generation transcripts.
  • @RealTrumpTalk, “A bot that uses the things that @realDonaldTrump tweets to create it’s own tweets.”

Testing Markovify with Shakespeare’s sonnets

No need to think about tweeting, let’s just generate random text from some known corpus.

Project Gutenberg is a great place to find text. For example, here is a copy of Shakespeare’s Sonnets:

http://www.gutenberg.org/cache/epub/1041/pg1041.txt

Which I’ve mirrored here:

http://stash.compciv.org/2017/shakepeare-sonnets.txt

You should visit the file in your browser, and/or view it in your text editor. It’s not pristine text, i.e. pure prose. Here’s an excerpt from the beginning:

Posting Date: April 7, 2014 [EBook #1041]
Release Date: September, 1997
Last Updated: March 10, 2010

Language: English

*** START OF THIS PROJECT GUTENBERG EBOOK SHAKESPEARE'S SONNETS ***




Produced by Joseph S. Miller and Embry-Riddle Aeronautical
University Library. HTML version by Al Haines.


THE SONNETS

by William Shakespeare



  I

  From fairest creatures we desire increase,
  That thereby beauty's rose might never die,
  But as the riper should by time decease,
  His tender heir might bear his memory:
  But thou, contracted to thine own bright eyes,
  Feed'st thy light's flame with self-substantial fuel,

I’m reasonably sure that the markovify library can handle whitespace. But what about the “metadata”, such as the chapter/verse numbering. Or the Gutenberg disclaimer text? All text is data to the Markov bot; it’s up to us to do any additional filtering.

In the script below, I do the most rudimentary of data cleaning. I start by downloading the text with good ol requests.get and using the splitlines() method to get a list of lines. Then I trim that list by starting with index 45 (i.e. line 46) and ending at index 2670.

Then I apply a simple string text to match only lines that are at least 30 characters long, i.e. using the len function.

According to the basic Markovify example, I just need to pass in a giant string, so I join the list of lines with a whitespace character. And then I let markovify do the rest:

import requests
import markovify

SRC_URL = 'http://stash.compciv.org/2017/shakespeare-sonnets.txt'

resp = requests.get(SRC_URL)
lines = resp.text.splitlines()[45:2670]

# filter out non-prose lines
proselines = []
for line in lines:
    if len(line) > 30: # ignore all lines fewer than 30 chars
        proselines.append(line.strip())

# create a string that is all lines joined together
prose = ' '.join(proselines)

# now make the model
shakespeare_model = markovify.Text(prose)

And now we generate random sentences:

>>> shakespeare_model.make_sentence()
'Take heed, dear heart, of this madding fever!'
# actual sentence:
# Take heed, dear heart, of this large privilege;

>>> shakespeare_model.make_sentence()
'When in the carcanet.'
# actual sentence:
# Or captain jewels in the carcanet.

And that’s a Markov generator in a nutshell, thanks to the hardwork of Jeremy Singer-Vine and his markovify library.

Of course, thinking about the model more mathematically, and thinking more about text as a linguist does, will yield even better results (see examples in markovify’s advanced usage docs; https://github.com/jsvine/markovify#advanced-usage)...but let’s get right into the fun of generating interesting text by finding interesting text sources.

Markovify and Twitter

Trump is a popular target for Markov chains, because his form of speech is so well known thanks to his Twitter account.

The description for the @RealTrumpTalk bot is:

https://twitter.com/RealTrumpTalk

A bot that uses the things that @realDonaldTrump tweets to create it’s own tweets. (This account is NOT affiliated with Donald J. Trump for President, Inc.)

Here’s a fun tweet:

https://twitter.com/RealTrumpTalk/status/677549812900868096

Order yours now–makes a great guy & assures me that “Trump” will be authentic!

So, the first problem is to get the text of Trump tweets. We don’t even have to rely on Twitter’s API, though. We can use the stashed copy of Trump tweets I have in CSV form:

http://stash.compciv.org/2017/realdonaldtrump-tweets.csv

import csv
import markovify
import requests

SRC_URL = 'http://stash.compciv.org/2017/realdonaldtrump-tweets.csv'

resp = requests.get(SRC_URL)
tweets = list(csv.DictReader(resp.text.splitlines()))
tweettext = ' '.join([t['Text'] for t in tweets])

trumpmodel = markovify.Text(tweettext)

And run the make_short_sentence method to see what kind of fun Markov is making:

>>> trumpmodel.make_short_sentence(140)
'He is being treated very badly by the antics of Crooked Hillary Clinton!'

>>> trumpmodel.make_short_sentence(140)
'It is so dishonest.'

Since the Trump bot has already been done, let’s add multiple sources of tweets to our model.

Remember this exercise?

Solid Serialization of: Multiple User Tweets (CSV)

That tweet-parsing exercise relied on 4 sources of tweets:

http://stash.compciv.org/2017/realdonaldtrump-tweets.csv http://stash.compciv.org/2017/hillaryclinton-tweets.csv http://stash.compciv.org/2017/jk_rowling-tweets.csv http://stash.compciv.org/2017/darrellissa-tweets.csv

See if you can figure this out for yourself:

import csv
import markovify
import requests

SCREENNAMES = ['realdonaldtrump', 'hillaryclinton', 'jk_rowling',
                'darrellissa']
BASE_URL = 'http://stash.compciv.org/2017/{}-tweets.csv'

bigtweettext = ""

for name in SCREENNAMES:
    url = BASE_URL.format(name)
    resp = requests.get(url)
    tweets = list(csv.DictReader(resp.text.splitlines()))
    # for some reason, some values are None...
    # will investigate for later
    tweettext = ' '.join([str(t['Text']) for t in tweets])

    bigtweettext += tweettext
    multimodel = markovify.Text(bigtweettext)

And here is that multilingual Markov tweet maker:

>>> multimodel.make_short_sentence(140)
'#KeystoneXL #KeystoneXL will go far in fighting terror.'
>>> multimodel.make_short_sentence(140)
'This week I passed legislation to ensure the process by which Secret Service hearing with @NBCNightlyNews.'
>>> multimodel.make_short_sentence(140)
"#EarnedIt Honored to have the power of our nation's uniform."

I’ll assume you already know how to call the Twitter API and do a search, if for some reason you need fresh tweets. But you should also recognize that sending tweets is a trivial step, now that we’ve generated the content:

from twython import Twython
from time import sleep
# do your own authenticating here
client = Twython('blah', 'blah')

while True:
    tweettext = multimodel.make_short_sentence(140)
    print("About to tweet:", tweettext)
    sleep(3)

    client.update_status(status=tweettext)

By now, you might have noticed how there’s huge chunks of code that do important things but don’t really need to talk to each other. Consider the code for generating the Markov chains, and the code for sending out tweets.

When creating a bot, you’ll likely see (or be strongly encouraged by me) to write it with separate script files.

Conclusion

And there you have it, a simple Markov-powered Twitter bot. I’ll leave it to you to prevent it from sounding like a Nazi:

Note

Related reading