On Cyber Monday Eve, Jeff Bezos revealed that Amazon may have intentions to one day deliver many of its goods by unmanned aerial vehicles through a service called Amazon Prime Air as part of an segment for the television show 60 Minutes. This notebook explores ~125k tweets from Twitter's firehose that were captured shortly after the announcement and teaches you how you can be equipped to capture interesting data within moments of announcements for your own analysis.
Let's seek to better understand the "Twitter reaction" to Amazon's announcement that drones may one day be delivering packages right to our doorsteps.
Twitter is an ideal source of data that can help you to understand the reaction to newsworthy events, because it has more than 200M active monthly users who tend to use it to frequently share short informal thoughts about anything and everything. Although Twitter offers a Search API that can be used to query for "historical data", tapping into the firehose with the Streaming API is a preferred option because it provides you the ability to acquire much larger volumes of data with keyword filters in real-time.
There are numerous options for storing the data that you acquire from the firehose. A document-oriented database such as MongoDB makes a fine choice and can provide useful APIs for filtering and analysis. However, we'll opt to simply store the tweets that we fetch from the firehose in a newline-delimited text file, because we'll use the pandas library to analyze it as opposed to relying on MongoDB or a comparable option.
Note: Should you have preferred to instead sink the data to MongoDB, the mongoexport commandline tool could have exported it to a newline delimited format that is exactly the same as what we will be writing to a file. Either way, you're covered.
There are only a few third-party packages that are required to use the code in this notebook:
You can easily install these packages in a terminal with pip install twitter pandas nltk, or you can install them from within IPython Notebook by using "Bash magic". Bash magic is just a way of running Bash commands from within a notebook as shown below where the first line of a cell prefixed with %%bash.
%%bash
pip install twitter pandas nltk
It's a lot easier to tap into Twitter's firehose than you might imagine if you're using the right library. The code below show you how to create a connection to Twitter's Streaming API and filter the firehose for tweets containing keywords. For simplicity, each tweet is saved in a newline-delimited file as a JSON document.
import io
import json
import twitter
# XXX: Go to http://twitter.com/apps/new to create an app and get values
# for these credentials that you'll need to provide in place of these
# empty string values that are defined as placeholders.
#
# See https://vimeo.com/79220146 for a short video that steps you
# through this process
#
# See https://dev.twitter.com/docs/auth/oauth for more information
# on Twitter's OAuth implementation.
CONSUMER_KEY = ''
CONSUMER_SECRET = ''
OAUTH_TOKEN = ''
OAUTH_TOKEN_SECRET = ''
# The keyword query
QUERY = 'Amazon'
# The file to write output as newline-delimited JSON documents
OUT_FILE = QUERY + ".json"
# Authenticate to Twitter with OAuth
auth = twitter.oauth.OAuth(OAUTH_TOKEN, OAUTH_TOKEN_SECRET,
CONSUMER_KEY, CONSUMER_SECRET)
# Create a connection to the Streaming API
twitter_stream = twitter.TwitterStream(auth=auth)
print 'Filtering the public timeline for "{0}"'.format(QUERY)
# See https://dev.twitter.com/docs/streaming-apis on keyword parameters
stream = twitter_stream.statuses.filter(track=QUERY)
# Write one tweet per line as a JSON document.
with io.open(OUT_FILE, 'w', encoding='utf-8', buffering=1) as f:
for tweet in stream:
f.write(unicode(u'{0}\n'.format(json.dumps(tweet, ensure_ascii=False))))
print tweet['text']
Assuming that you've amassed a collection of tweets from the firehose in a line-delimited format, one of the easiest ways to load the data into pandas for analysis is to build a valid JSON array of the tweets.
Note: With pandas, you will need to have an amount of working memory proportional to the amount of data that you're analyzing. For reference, it takes on the order of ~8GB of memory to analyze ~125k tweets as shown in this notebook. (Bear in mind that each tweet is roughly 5KB of text when serialized out to a file.)
import pandas as pd
# A text file with one tweet per line
DATA_FILE = "tmp/Amazon.json"
# Build a JSON array
data = "[{0}]".format(",".join([l for l in open(DATA_FILE).readlines()]))
# Create a pandas DataFrame (think: 2-dimensional table) to get a
# spreadsheet-like interface into the data
df = pd.read_json(data, orient='records')
print "Successfully imported", len(df), "tweets"
Whereas you may be used to thinking of data such as a list of dictionaries in a rows-oriented paradigm, pandas DataFrame exposes a convenient columnar view of the data that makes it easy to slice and dice by particular fields in each record. You can print the data frame to display the columnar structure and some stats about each column.
# Printing a DataFrame shows how pandas exposes a columnar view of the data
print df
Some of the items in a data frame may be null values, and these null values can wreak all kinds of havoc during analysis. Once you understand why they exist, it's wise to filter them out if possible. The null values in this collection of tweets are caused by "limit notices", which Twitter sends to tell you that you're being rate-limited. Notice in the columnar output above that the "limit" field (which is not typically part of a tweet) appears 16 times. This indicates that we received 16 limit notices and means that there are effectively 16 "rows" in our data frame that has null values for all of the fields we'd have expected to see.
Per the Streaming API guidelines, Twitter will only provide up to 1% of the total volume of the firehose, and anything beyond that is filtered out with each "limit notice" telling you how many tweets were filtered out. This means that tweets containing "Amazon" accounted for at least 1% of the total tweet volume at the time this data was being collected. The next cell shows how to "pop" off the column containing the sixteen limit notices and sum up the totals across these limit notices so that we can learn exactly how many tweets were filtered out across the aggregate.
# Observe the "limit" field that reflects "limit notices" where the streaming API
# couldn't return more than 1% of the firehose.
# See https://dev.twitter.com/docs/streaming-apis/messages#Limit_notices_limit
# Capture the limit notices by indexing into the data frame for non-null field
# containing "limit"
limit_notices = df[pd.notnull(df.limit)]
# Remove the limit notice column from the DataFrame entirely
df = df[pd.notnull(df['id'])]
print "Number of total tweets that were rate-limited", sum([ln['track'] for ln in limit_notices.limit])
print "Total number of limit notices", len(limit_notices)
From this output, we can observe that ~1k tweets were not provided out of ~125k, more than 99% of the tweets about "Amazon" were received for the time period that they were being captured. In order to learn more about the bounds of that time period, let's create a time-based index on the created_at field of each tweet so that we can perform a time-based analysis.
# Create a time-based index on the tweets for time series analysis
# on the created_at field of the existing DataFrame.
df.set_index('created_at', drop=False, inplace=True)
print "Created date/time index on tweets"
With a time-based index now in place, we can trivially do some useful things like calculate the boundaries, compute histograms, etc. Since tweets through to our filter in roughly the order in which they are created, no additional sorting should be necessary in order to compute the timeframe for this dataset; we can just slice the DataFrame like a list.
# Get a sense of the time range for the data
print "First tweet timestamp (UTC)", df['created_at'][0]
print "Last tweet timestamp (UTC) ", df['created_at'][-1]
Operations such as grouping by a time unit are also easy to accomplish and seem a logical next step. The following cell illustrates how to group by the "hour" of our data frame, which is exposed as a datetime.datetime timestamp since we now have a time-based index in place.
# Let's group the tweets by hour and look at the overall volumes with a simple
# text-based histogram
# First group by the hour
grouped = df.groupby(lambda x: x.hour)
print "Number of relevant tweets by the hour (UTC)"
print
# You can iterate over the groups and print
# out the volume of tweets for each hour
# along with a simple text-based histogram
for hour, group in grouped:
print hour, len(group), '*'*(len(group) / 1000)
Bearing in mind that we just previously learned that tweet acquisition began at 1:41 UTC and ended at 5:01 UTC, it could be helpful to further subdivide the time ranges into smaller intervals so as to increase the resolution of the extremes. Therefore, let's group into a custom interval by dividing the hour into 15-minute segments. The code is pretty much the same as before except that you call a custom function to perform the grouping; pandas takes care of the rest.
# Let's group the tweets by (hour, minute) and look at the overall volumes with a simple
# text-based histogram
def group_by_15_min_intervals(x):
if 0 <= x.minute <= 15: return (x.hour, "0-15")
elif 15 < x.minute <= 30: return (x.hour, "16-30")
elif 30 < x.minute <= 45: return (x.hour, "31-45")
else: return (x.hour, "46-00")
grouped = df.groupby(lambda x: group_by_15_min_intervals(x))
print "Number of relevant tweets by intervals (UTC)"
print
for interval, group in grouped:
print interval, len(group), "\t", '*'*(len(group) / 200)
# Since we didn't start or end precisely on an interval, let's
# slice off the extremes. This has the added benefit of also
# improving the resolution of the plot that shows the trend
plt.plot([len(group) for hour, group in grouped][1:-1])
plt.ylabel("Tweet Volume")
plt.xlabel("Time")
In addition to time-based analysis, we can do other types of analysis as well. Generally speaking, one of the first things you'll want to do when exploring new data is count things, so let's compute the Twitter accounts that authored the most tweets and compare it to the total number of unique accounts that appeared.
from collections import Counter
# The "user" field is a record (dictionary), and we can pop it off
# and then use the Series constructor to make it easy to use with pandas.
user_col = df.pop('user').apply(pd.Series)
# Get the screen name column
authors = user_col.screen_name
# And count things
authors_counter = Counter(authors.values)
# And tally the totals
print
print "Most frequent (top 25) authors of tweets"
print '\n'.join(["{0}\t{1}".format(a, f) for a, f in authors_counter.most_common(25)])
print
# Get only the unique authors
num_unique_authors = len(set(authors.values))
print "There are {0} unique authors out of {1} tweets".format(num_unique_authors, len(df))
At first glance, it would appear that there are quite a few bots accounting for a non-trivial portion of the tweet volume, and many of them appear to be Japanese! As usual, we can plot these values to get better intution about the underlying distrubution, so let's take a quick look at a frequency plot and histogram. We'll use logarithmic adjustments in both cases, so pay close attention to axis values.
# Plot by rank (sorted value) to gain intution about the shape of the distrubtion
author_freqs = sorted(authors_counter.values())
plt.loglog(author_freqs)
plt.ylabel("Num Tweets by Author")
plt.xlabel("Author Rank")
# Start a new figure
plt.figure()
# Plot a histogram to "zoom in" and increase resolution.
plt.hist(author_freqs, log=True)
plt.ylabel("Num Authors")
plt.xlabel("Num Tweets")
Although we could filter the DataFrame for coordinates (or locations in user profiles), an even simpler starting point to gain rudimentary insight about where users might be located is to inspect the language field of the tweets and compute the tallies for each language. With pandas, it's just a quick one-liner.
# What languages do authors of tweets speak? This might be a useful clue
# as to who is tweeting. (Also bear in mind the general timeframe for the
# data when interpreting these results.)
df.lang.value_counts()
A staggering number of Japanese speakers were talking about "Amazon" at the time the data was collected. Bearing in mind that it was already mid-day on Monday in Japan when it the news of the Amazon drones started to surface in the United States on Sunday evening, is this really all that surprising given Twitter's popularity in Japan?
Filtering on language also affords us to remove some noise from analysis since we can filter out only tweets in a specific language for inspection, which will be handy for some analysis on the content of the tweets themselves. Let's filter out only the 140 characters of text from tweets where the author speaks English and use some natural language processing techniques to learn more about the reaction.
# Let's just look at the content of the English tweets by extracting it
# out as a list of text
en_text = df[df['lang'] == 'en'].pop('text')
Although NLTK provides some advanced tokenization functions, let's just split the English text on white space, normalize it to lowercase, and remove some common trailing punctuation and count things to get an initial glance in to what's being talked about.
from collections import Counter
tokens = []
for txt in en_text.values:
tokens.extend([t.lower().strip(":,.") for t in txt.split()])
# Use a Counter to construct frequency tuples
tokens_counter = Counter(tokens)
# Display some of the most commonly occurring tokens
tokens_counter.most_common(50)
Not surprisingly, "amazon" is the most frequently occurring token, there are lots of retweets (actually, "quoted retweets") as evidenced by "rt", and lots of stopwords (commonly occurring words like "the", "and", etc.) at the top of the list. Let's further remove some of the noise by removing stopwords.
import nltk
# Download the stopwords list into NLTK
nltk.download('stopwords')
# Remove stopwords to decrease noise
for t in nltk.corpus.stopwords.words('english'):
tokens_counter.pop(t)
# Redisplay the data (and then some)
tokens_counter.most_common(200)
What a difference removing a little bit of noise can make! We now see much more meaningful data appear at the top of the list: drones, signs that a phrase "30 mins" (which turned out to be a possible timeframe for a Prime Air delivery by a drone according to Bezos) might appear based the appearance of "30" and "mins"/"minutes" near the top of the list), signs of another phrase "prime air" (as evidenced by "prime", "air" and the hashtag "#primeair"), references to Jeff Bezos, URLs to investigate and more!
Even though we've already learned a lot, one of the challenges with only employing crude tokenization techniques is that you aren't left with any phrases. One of the simplest ways of disocvering meaningful phrases in text is to treat the problem as one of discovering statistical collocations. NLTK provides some routines to find collocations and includes a "demo" function that's a quick one-liner.
nltk_text = nltk.Text(tokens)
nltk_text.collocations()
Even without any prior analysis on tokenization, it's pretty clear what the topis is about as evidenced by this list of collocations. But what about the context in which these phrases appear? As it turns out, NLTK supplies another handy data structure that provides some insight as to how words appear in context called a concordance. Trying out the "demo functionality" for the concordance is as simple as just calling it as shown below.
Toward the bottom of the list of commonly occurring tokens, the words "amazing" and "holy" appear. The word "amazing" is interesting, because it is usually the basis of an emotional reaction, and we're interested in examining the reaction. What about word "holy"? What might it mean? The concordance will help us to find out...
nltk_text.concordance("amazing")
print
nltk_text.concordance("holy")
It would appear that there is indeed a common thread of amazement in the data, and it's evident that @joshuatopolsky (who turns out to be Editor-in-chief of The Verge) is a commonly occurring tweet entity that warrants further investigation. Speaking of tweet entities, let's take an initial look at usernames, hashtags, and URLs by employing a simple heuristic to look for words prefixed with @, RT, #, and http to see what some of the most commonly occurring tweet entiteis are in the data.
# An crude look at tweet entities
entities = []
for txt in en_text.values:
for t in txt.split():
if t.startswith("http") or t.startswith("@") or t.startswith("#") or t.startswith("RT @"):
if not t.startswith("http"):
t = t.lower()
entities.append(t.strip(" :,"))
entities_counter = Counter(entities)
for entity, freq in entities_counter.most_common()[:100]:
print entity, freq
As you can see, there are lots of intersting tweet entities that give you helpful context for the announcement. One particularly notable observation is the appearance of "comedic accounts" such as @deathstarpr and @amazondrone near the top of the list, relaying a certain amount of humor. The tweet embedded below that references Star Wars was eventually retweeted over 1k times in response to the announcement! It wouldn't be difficult to determine how many retweets occurred just within the ~3 hour timeframe corresponding to the dataset we're using here.
First look at Amazon's new delivery drone. (Also helpful for finding Rebel bases on Hoth.) pic.twitter.com/JlFdNiHzks
— Death Star PR (@DeathStarPR) December 2, 2013
When you take a closer look at some of the developed news stories, you also see sarcasm, unbelief, and even a bit of frustration about this being a "publicity stunt" for Cyber Monday.
Note: There proper way of parsing out tweet entities from the entities field that you can see in the DataFrame. It's marginally more work but has the primary advantage that you can see the "expanded URL" which provides better insight into the nature of the URL since you'll know its domain name. See Example 9-10, Extracting Tweet Entities from Mining the Social Web for more on how to do that.
We aspired to learn more about the general reaction to Amazon's announcement about Prime Air by taking an initial look at the data from Amazon's firehose, and it's fair to say that we learned a few things about the data without too much effort. Lots more could be discovered, but a few of the themes that we were able to glean included:
Although these reactions aren't particularly surprising for such an outrageous announcement, you've hopefully learned enough that you could tap into Twitter's firehose to capture and analyze data that's of interest to you. There is no shortage of fun to be had, and as you've learned, it's easier than it might first appear.
Enjoy!
If you enjoy analyzing data from social websites like Twitter, then you might enjoy the book Mining the Social Web, 2nd Edition (O'Reilly). You can learn more about it at MiningTheSocialWeb.com. All source code is available in IPython Notebook format at GitHub and can be previewed in the IPython Notebook Viewer.
The book itself is a form of "premium support" for the source code and is available for purchase from Amazon or O'Reilly Media.