Understanding the Reaction to Amazon Prime Air (Or: Tapping Twitter's Firehose for Fun and Profit with pandas)¶

On Cyber Monday Eve, Jeff Bezos revealed that Amazon may have intentions to one day deliver many of its goods by unmanned aerial vehicles through a service called Amazon Prime Air as part of an segment for the television show 60 Minutes. This notebook explores ~125k tweets from Twitter's firehose that were captured shortly after the announcement and teaches you how you can be equipped to capture interesting data within moments of announcements for your own analysis.

Aspire¶

Let's seek to better understand the "Twitter reaction" to Amazon's announcement that drones may one day be delivering packages right to our doorsteps.

Acquire¶

Twitter is an ideal source of data that can help you to understand the reaction to newsworthy events, because it has more than 200M active monthly users who tend to use it to frequently share short informal thoughts about anything and everything. Although Twitter offers a Search API that can be used to query for "historical data", tapping into the firehose with the Streaming API is a preferred option because it provides you the ability to acquire much larger volumes of data with keyword filters in real-time.

There are numerous options for storing the data that you acquire from the firehose. A document-oriented database such as MongoDB makes a fine choice and can provide useful APIs for filtering and analysis. However, we'll opt to simply store the tweets that we fetch from the firehose in a newline-delimited text file, because we'll use the pandas library to analyze it as opposed to relying on MongoDB or a comparable option.

Note: Should you have preferred to instead sink the data to MongoDB, the mongoexport commandline tool could have exported it to a newline delimited format that is exactly the same as what we will be writing to a file. Either way, you're covered.

Python Dependencies¶

There are only a few third-party packages that are required to use the code in this notebook:

The twitter package trivializes the process of tapping into Twitter's Streaming API for easily capturing tweets from the firehose
The pandas package provides a highly-performant "spreadsheet-like interface" into large collections of records such as tweets
The nltk packages provides some handy functions for processing natural language (the "140 characters" of content) in the tweets

You can easily install these packages in a terminal with pip install twitter pandas nltk, or you can install them from within IPython Notebook by using "Bash magic". Bash magic is just a way of running Bash commands from within a notebook as shown below where the first line of a cell prefixed with %%bash.

%%bash

pip install twitter pandas nltk

Tapping Twitter's Firehose¶

It's a lot easier to tap into Twitter's firehose than you might imagine if you're using the right library. The code below show you how to create a connection to Twitter's Streaming API and filter the firehose for tweets containing keywords. For simplicity, each tweet is saved in a newline-delimited file as a JSON document.

import io
import json
import twitter

# XXX: Go to http://twitter.com/apps/new to create an app and get values
# for these credentials that you'll need to provide in place of these
# empty string values that are defined as placeholders.
#
# See https://vimeo.com/79220146 for a short video that steps you
# through this process
#
# See https://dev.twitter.com/docs/auth/oauth for more information 
# on Twitter's OAuth implementation.

CONSUMER_KEY = ''
CONSUMER_SECRET = ''
OAUTH_TOKEN = ''
OAUTH_TOKEN_SECRET = ''

# The keyword query

QUERY = 'Amazon'

# The file to write output as newline-delimited JSON documents
OUT_FILE = QUERY + ".json"


# Authenticate to Twitter with OAuth

auth = twitter.oauth.OAuth(OAUTH_TOKEN, OAUTH_TOKEN_SECRET,
                           CONSUMER_KEY, CONSUMER_SECRET)

# Create a connection to the Streaming API

twitter_stream = twitter.TwitterStream(auth=auth)


print 'Filtering the public timeline for "{0}"'.format(QUERY)

# See https://dev.twitter.com/docs/streaming-apis on keyword parameters

stream = twitter_stream.statuses.filter(track=QUERY)

# Write one tweet per line as a JSON document. 

with io.open(OUT_FILE, 'w', encoding='utf-8', buffering=1) as f:
    for tweet in stream:
        f.write(unicode(u'{0}\n'.format(json.dumps(tweet, ensure_ascii=False))))
        print tweet['text']

Analyze¶

Assuming that you've amassed a collection of tweets from the firehose in a line-delimited format, one of the easiest ways to load the data into pandas for analysis is to build a valid JSON array of the tweets.

Note: With pandas, you will need to have an amount of working memory proportional to the amount of data that you're analyzing. For reference, it takes on the order of ~8GB of memory to analyze ~125k tweets as shown in this notebook. (Bear in mind that each tweet is roughly 5KB of text when serialized out to a file.)

import pandas as pd

# A text file with one tweet per line

DATA_FILE = "tmp/Amazon.json"

# Build a JSON array

data = "[{0}]".format(",".join([l for l in open(DATA_FILE).readlines()]))

# Create a pandas DataFrame (think: 2-dimensional table) to get a 
# spreadsheet-like interface into the data

df = pd.read_json(data, orient='records')

print "Successfully imported", len(df), "tweets"

Successfully imported 125697 tweets

Whereas you may be used to thinking of data such as a list of dictionaries in a rows-oriented paradigm, pandas DataFrame exposes a convenient columnar view of the data that makes it easy to slice and dice by particular fields in each record. You can print the data frame to display the columnar structure and some stats about each column.

# Printing a DataFrame shows how pandas exposes a columnar view of the data

print df

<class 'pandas.core.frame.DataFrame'>
Int64Index: 125697 entries, 0 to 125696
Data columns (total 27 columns):
_id                          125697  non-null values
contributors                 0  non-null values
coordinates                  1102  non-null values
created_at                   125681  non-null values
entities                     125681  non-null values
favorite_count               125681  non-null values
favorited                    125681  non-null values
filter_level                 125681  non-null values
geo                          1102  non-null values
id                           125681  non-null values
id_str                       125681  non-null values
in_reply_to_screen_name      10001  non-null values
in_reply_to_status_id        5927  non-null values
in_reply_to_status_id_str    5927  non-null values
in_reply_to_user_id          10001  non-null values
in_reply_to_user_id_str      10001  non-null values
lang                         125681  non-null values
limit                        16  non-null values
place                        1442  non-null values
possibly_sensitive           90143  non-null values
retweet_count                125681  non-null values
retweeted                    125681  non-null values
retweeted_status             40297  non-null values
source                       125681  non-null values
text                         125681  non-null values
truncated                    125681  non-null values
user                         125681  non-null values
dtypes: datetime64[ns](1), float64(13), object(13)

Some of the items in a data frame may be null values, and these null values can wreak all kinds of havoc during analysis. Once you understand why they exist, it's wise to filter them out if possible. The null values in this collection of tweets are caused by "limit notices", which Twitter sends to tell you that you're being rate-limited. Notice in the columnar output above that the "limit" field (which is not typically part of a tweet) appears 16 times. This indicates that we received 16 limit notices and means that there are effectively 16 "rows" in our data frame that has null values for all of the fields we'd have expected to see.

Per the Streaming API guidelines, Twitter will only provide up to 1% of the total volume of the firehose, and anything beyond that is filtered out with each "limit notice" telling you how many tweets were filtered out. This means that tweets containing "Amazon" accounted for at least 1% of the total tweet volume at the time this data was being collected. The next cell shows how to "pop" off the column containing the sixteen limit notices and sum up the totals across these limit notices so that we can learn exactly how many tweets were filtered out across the aggregate.

# Observe the "limit" field that reflects "limit notices" where the streaming API
# couldn't return more than 1% of the firehose.
# See https://dev.twitter.com/docs/streaming-apis/messages#Limit_notices_limit

# Capture the limit notices by indexing into the data frame for non-null field
# containing "limit"

limit_notices = df[pd.notnull(df.limit)]

# Remove the limit notice column from the DataFrame entirely

df = df[pd.notnull(df['id'])]

print "Number of total tweets that were rate-limited", sum([ln['track'] for ln in limit_notices.limit])
print "Total number of limit notices", len(limit_notices)

Number of total tweets that were rate-limited 1062
Total number of limit notices 16

From this output, we can observe that ~1k tweets were not provided out of ~125k, more than 99% of the tweets about "Amazon" were received for the time period that they were being captured. In order to learn more about the bounds of that time period, let's create a time-based index on the created_at field of each tweet so that we can perform a time-based analysis.

# Create a time-based index on the tweets for time series analysis
# on the created_at field of the existing DataFrame.

df.set_index('created_at', drop=False, inplace=True)

print "Created date/time index on tweets"

Created date/time index on tweets

With a time-based index now in place, we can trivially do some useful things like calculate the boundaries, compute histograms, etc. Since tweets through to our filter in roughly the order in which they are created, no additional sorting should be necessary in order to compute the timeframe for this dataset; we can just slice the DataFrame like a list.

# Get a sense of the time range for the data

print "First tweet timestamp (UTC)", df['created_at'][0]
print "Last tweet timestamp (UTC) ", df['created_at'][-1]

First tweet timestamp (UTC) 2013-12-02 01:41:45
Last tweet timestamp (UTC)  2013-12-02 05:01:18

Operations such as grouping by a time unit are also easy to accomplish and seem a logical next step. The following cell illustrates how to group by the "hour" of our data frame, which is exposed as a datetime.datetime timestamp since we now have a time-based index in place.

# Let's group the tweets by hour and look at the overall volumes with a simple
# text-based histogram

# First group by the hour

grouped = df.groupby(lambda x: x.hour)

print "Number of relevant tweets by the hour (UTC)"
print

# You can iterate over the groups and print 
# out the volume of tweets for each hour 
# along with a simple text-based histogram

for hour, group in grouped:
    print hour, len(group), '*'*(len(group) / 1000)

Number of relevant tweets by the hour (UTC)

1 14788 **************
2 43286 *******************************************
3 36582 ************************************
4 30008 ******************************
5 1017 *

Bearing in mind that we just previously learned that tweet acquisition began at 1:41 UTC and ended at 5:01 UTC, it could be helpful to further subdivide the time ranges into smaller intervals so as to increase the resolution of the extremes. Therefore, let's group into a custom interval by dividing the hour into 15-minute segments. The code is pretty much the same as before except that you call a custom function to perform the grouping; pandas takes care of the rest.

# Let's group the tweets by (hour, minute) and look at the overall volumes with a simple
# text-based histogram

def group_by_15_min_intervals(x):
    if   0 <= x.minute <= 15: return (x.hour, "0-15")
    elif 15 < x.minute <= 30: return (x.hour, "16-30")
    elif 30 < x.minute <= 45: return (x.hour, "31-45")
    else: return (x.hour, "46-00")


grouped = df.groupby(lambda x: group_by_15_min_intervals(x))

print "Number of relevant tweets by intervals (UTC)"
print

for interval, group in grouped:
    print interval, len(group), "\t", '*'*(len(group) / 200)

# Since we didn't start or end precisely on an interval, let's
# slice off the extremes. This has the added benefit of also
# improving the resolution of the plot that shows the trend
plt.plot([len(group) for hour, group in grouped][1:-1])
plt.ylabel("Tweet Volume")
plt.xlabel("Time")

Number of relevant tweets by intervals (UTC)

(1, '31-45') 2875 	**************
(1, '46-00') 11913 	***********************************************************
(2, '0-15') 13611 	********************************************************************
(2, '16-30') 11265 	********************************************************
(2, '31-45') 10452 	****************************************************
(2, '46-00') 7958 	***************************************
(3, '0-15') 10386 	***************************************************
(3, '16-30') 9542 	***********************************************
(3, '31-45') 8727 	*******************************************
(3, '46-00') 7927 	***************************************
(4, '0-15') 9042 	*********************************************
(4, '16-30') 7543 	*************************************
(4, '31-45') 7074 	***********************************
(4, '46-00') 6349 	*******************************
(5, '0-15') 1017 	*****

<matplotlib.text.Text at 0x1e9a9d50>

In addition to time-based analysis, we can do other types of analysis as well. Generally speaking, one of the first things you'll want to do when exploring new data is count things, so let's compute the Twitter accounts that authored the most tweets and compare it to the total number of unique accounts that appeared.

from collections import Counter

# The "user" field is a record (dictionary), and we can pop it off
# and then use the Series constructor to make it easy to use with pandas.

user_col = df.pop('user').apply(pd.Series)

# Get the screen name column
authors = user_col.screen_name

# And count things
authors_counter = Counter(authors.values)

# And tally the totals

print
print "Most frequent (top 25) authors of tweets"
print '\n'.join(["{0}\t{1}".format(a, f) for a, f in authors_counter.most_common(25)])
print

# Get only the unique authors

num_unique_authors = len(set(authors.values))
print "There are {0} unique authors out of {1} tweets".format(num_unique_authors, len(df))

Most frequent (top 25) authors of tweets
_net_shop_	165
PC_shop_japan	161
work_black	160
house_book_jp	160
bousui_jp	147
Popular_goods	147
pachisuro_777	147
sweets_shop	146
bestshop_goods	146
__electronics__	142
realtime_trend	141
gardening_jp	140
shikaku_book	139
supplement_	139
__travel__	138
disc_jockey_jp	138
water_summer_go	138
Jungle_jp	137
necessaries_jp	137
marry_for_love	137
trend_realtime	136
sparkler_jp	136
PandoraQQ	133
flypfox	133
Promo_Culturel	132

There are 71794 unique authors out of 125681 tweets

At first glance, it would appear that there are quite a few bots accounting for a non-trivial portion of the tweet volume, and many of them appear to be Japanese! As usual, we can plot these values to get better intution about the underlying distrubution, so let's take a quick look at a frequency plot and histogram. We'll use logarithmic adjustments in both cases, so pay close attention to axis values.

# Plot by rank (sorted value) to gain intution about the shape of the distrubtion

author_freqs = sorted(authors_counter.values())

plt.loglog(author_freqs)
plt.ylabel("Num Tweets by Author")
plt.xlabel("Author Rank")

# Start  a new figure

plt.figure()

# Plot a histogram to "zoom in" and increase resolution.

plt.hist(author_freqs, log=True)
plt.ylabel("Num Authors")
plt.xlabel("Num Tweets")

<matplotlib.text.Text at 0x21c29fd0>

Although we could filter the DataFrame for coordinates (or locations in user profiles), an even simpler starting point to gain rudimentary insight about where users might be located is to inspect the language field of the tweets and compute the tallies for each language. With pandas, it's just a quick one-liner.

# What languages do authors of tweets speak? This might be a useful clue
# as to who is tweeting. (Also bear in mind the general timeframe for the 
# data when interpreting these results.)

df.lang.value_counts()

en     79151
ja     35940
und     3197
es      2713
de      1142
fr       717
id       442
pt       434
ko       283
vi       248
nl       212
th       209
zh       135
sk       114
ru        84
da        73
it        65
sl        65
pl        64
ht        63
et        56
tr        53
tl        43
ar        38
lt        30
no        17
lv        16
fi        15
hu        13
sv        12
bg         8
ne         7
el         5
he         5
fa         4
uk         3
my         2
is         2
ta         1
dtype: int64

A staggering number of Japanese speakers were talking about "Amazon" at the time the data was collected. Bearing in mind that it was already mid-day on Monday in Japan when it the news of the Amazon drones started to surface in the United States on Sunday evening, is this really all that surprising given Twitter's popularity in Japan?

Filtering on language also affords us to remove some noise from analysis since we can filter out only tweets in a specific language for inspection, which will be handy for some analysis on the content of the tweets themselves. Let's filter out only the 140 characters of text from tweets where the author speaks English and use some natural language processing techniques to learn more about the reaction.

# Let's just look at the content of the English tweets by extracting it
# out as a list of text

en_text = df[df['lang'] == 'en'].pop('text')

Although NLTK provides some advanced tokenization functions, let's just split the English text on white space, normalize it to lowercase, and remove some common trailing punctuation and count things to get an initial glance in to what's being talked about.

from collections import Counter

tokens = []
for txt in en_text.values:
    tokens.extend([t.lower().strip(":,.") for t in txt.split()])
    
# Use a Counter to construct frequency tuples
tokens_counter = Counter(tokens)

# Display some of the most commonly occurring tokens
tokens_counter.most_common(50)

[(u'amazon', 54778),
 (u'rt', 36409),
 (u'the', 25749),
 (u'drones', 24474),
 (u'to', 24040),
 (u'a', 21341),
 (u'delivery', 18557),
 (u'in', 17531),
 (u'of', 15741),
 (u'on', 14095),
 (u'drone', 13800),
 (u'by', 13422),
 (u'is', 12034),
 (u'for', 10988),
 (u'@amazon', 9318),
 (u'i', 9263),
 (u'and', 8793),
 (u'prime', 8783),
 (u'30', 8319),
 (u'air', 8026),
 (u'with', 7956),
 (u'future', 7911),
 (u'deliver', 7890),
 (u'get', 6790),
 (u'you', 6573),
 (u'your', 6543),
 (u'via', 6444),
 (u'deliveries', 6432),
 (u'this', 5899),
 (u'bezos', 5738),
 (u'will', 5703),
 (u'#primeair', 5680),
 (u'unmanned', 5442),
 (u'aerial', 5313),
 (u'under', 5308),
 (u'-', 5257),
 (u'mins', 5199),
 (u'that', 4890),
 (u'vehicles', 4835),
 (u'my', 4728),
 (u'from', 4720),
 (u'peek', 4699),
 (u'sneak', 4684),
 (u'unveils', 4555),
 (u'it', 4473),
 (u'minutes', 4459),
 (u'just', 4396),
 (u'at', 4394),
 (u'http://t.c\u2026', 4391),
 (u'packages', 4302)]

Not surprisingly, "amazon" is the most frequently occurring token, there are lots of retweets (actually, "quoted retweets") as evidenced by "rt", and lots of stopwords (commonly occurring words like "the", "and", etc.) at the top of the list. Let's further remove some of the noise by removing stopwords.

import nltk

# Download the stopwords list into NLTK

nltk.download('stopwords')

# Remove stopwords to decrease noise
for t in nltk.corpus.stopwords.words('english'):
    tokens_counter.pop(t)
    
# Redisplay the data (and then some)
tokens_counter.most_common(200)

[nltk_data] Downloading package 'stopwords' to /usr/share/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!

[(u'amazon', 54778),
 (u'rt', 36409),
 (u'drones', 24474),
 (u'delivery', 18557),
 (u'drone', 13800),
 (u'@amazon', 9318),
 (u'prime', 8783),
 (u'30', 8319),
 (u'air', 8026),
 (u'future', 7911),
 (u'deliver', 7890),
 (u'get', 6790),
 (u'via', 6444),
 (u'deliveries', 6432),
 (u'bezos', 5738),
 (u'#primeair', 5680),
 (u'unmanned', 5442),
 (u'aerial', 5313),
 (u'-', 5257),
 (u'mins', 5199),
 (u'vehicles', 4835),
 (u'peek', 4699),
 (u'sneak', 4684),
 (u'unveils', 4555),
 (u'minutes', 4459),
 (u'http://t.c\u2026', 4391),
 (u'packages', 4302),
 (u'jeff', 4040),
 (u'http://t.co/w6kugw4egt', 3922),
 (u"amazon's", 3669),
 (u'flying', 3599),
 (u'ceo', 3205),
 (u'#amazon', 3074),
 (u'new', 2870),
 (u'free', 2797),
 (u'testing', 2585),
 (u'could', 2568),
 (u'shipping', 2541),
 (u'', 2422),
 (u'says', 2343),
 (u"'60", 2324),
 (u'like', 2300),
 (u'stuff', 2263),
 (u'years', 2194),
 (u'60', 2157),
 (u'use', 2134),
 (u'using', 1939),
 (u'&amp;', 1901),
 (u"minutes'", 1868),
 (u'kindle', 1735),
 (u"it's", 1657),
 (u'plans', 1655),
 (u'cyber', 1622),
 (u'one', 1617),
 (u'gift', 1614),
 (u"i'm", 1604),
 (u'monday', 1568),
 (u'wants', 1538),
 (u'first', 1522),
 (u'order', 1519),
 (u'good', 1479),
 (u'going', 1459),
 (u'package', 1446),
 (u'fire', 1400),
 (u'look', 1386),
 (u'plan', 1378),
 (u'4', 1377),
 (u'delivering', 1376),
 (u'@60minutes', 1371),
 (u'make', 1369),
 (u'experimenting', 1341),
 (u'30-minute', 1336),
 (u'book', 1330),
 (u'primeair', 1310),
 (u'real', 1285),
 (u'online', 1274),
 (u'coming', 1261),
 (u'think', 1195),
 (u'see', 1152),
 (u'video', 1149),
 (u'next', 1149),
 (u'would', 1135),
 (u'system', 1131),
 (u'service', 1115),
 (u'thing', 1099),
 (u'something', 1069),
 (u'hour', 1052),
 (u'black', 1043),
 (u'card', 1040),
 (u'half', 1033),
 (u'want', 1018),
 (u'half-hour', 1016),
 (u'futuristic', 1016),
 (u"you're", 998),
 (u'know', 987),
 (u'love', 985),
 (u'people', 964),
 (u'aims', 964),
 (u'(video)', 958),
 (u'day', 954),
 (u'shot', 936),
 (u'deploy', 921),
 (u'delivered', 919),
 (u'amazon\u2019s', 906),
 (u'basically', 902),
 (u'within', 888),
 (u'shop', 886),
 (u'really', 882),
 (u'buy', 876),
 (u'check', 859),
 (u'\u2026', 855),
 (u'us', 844),
 (u'time', 829),
 (u'autonomous', 817),
 (u'wait', 815),
 (u'right', 801),
 (u'@mashable', 787),
 (u'finding', 786),
 (u'go', 780),
 (u'2015', 779),
 (u"can't", 774),
 (u'@buzzfeed', 774),
 (u'top', 774),
 (u'cool', 770),
 (u'rebel', 767),
 (u'@amazondrone', 766),
 (u'(also', 762),
 (u'helpful', 762),
 (u'#drones', 761),
 (u'rifle', 759),
 (u'reveals', 759),
 (u'door', 755),
 (u'bases', 752),
 (u'store', 751),
 (u'hoth.)', 748),
 (u'shit', 748),
 (u'@bradplumer', 743),
 (u'waiting', 743),
 (u'looks', 735),
 (u'@deathstarpr', 735),
 (u"don't", 732),
 (u'5', 731),
 (u'win', 731),
 (u'floats', 717),
 (u'friday', 717),
 (u'way', 717),
 (u'great', 713),
 (u'http://t.co/jlfdnihzks', 711),
 (u'company', 710),
 (u'need', 709),
 (u'read', 704),
 (u'home', 704),
 (u'watch', 697),
 (u'moment', 691),
 (u'http://t.co/bxsavvzxzf', 690),
 (u'best', 685),
 (u'notion', 680),
 (u'news', 669),
 (u'blog', 669),
 (u'announces', 667),
 (u'got', 658),
 (u'$25', 654),
 (u'products', 646),
 (u'big', 645),
 (u'still', 642),
 (u'2', 642),
 (u'gonna', 642),
 (u'tip', 636),
 (u'sales', 623),
 (u'awkward', 618),
 (u'"amazon', 617),
 (u'idea', 609),
 (u'take', 604),
 (u'working', 600),
 (u'books', 597),
 (u"won't", 597),
 (u'hovers', 593),
 (u'wow', 589),
 (u'live', 587),
 (u'promises', 579),
 (u'back', 576),
 (u'package-delivery', 573),
 (u'@badbanana', 570),
 (u'soon', 563),
 (u'deals', 560),
 (u'+', 558),
 (u'work', 555),
 (u'ever', 552),
 (u"'octocopter'", 551),
 (u'$50', 549),
 (u'hit', 549),
 (u'holy', 546),
 (u'night', 537),
 (u'hdx', 535),
 (u'today', 526),
 (u'bits', 521),
 (u'many', 520),
 (u'awesome', 519),
 (u'amazing', 508),
 (u'window', 506)]

What a difference removing a little bit of noise can make! We now see much more meaningful data appear at the top of the list: drones, signs that a phrase "30 mins" (which turned out to be a possible timeframe for a Prime Air delivery by a drone according to Bezos) might appear based the appearance of "30" and "mins"/"minutes" near the top of the list), signs of another phrase "prime air" (as evidenced by "prime", "air" and the hashtag "#primeair"), references to Jeff Bezos, URLs to investigate and more!

Even though we've already learned a lot, one of the challenges with only employing crude tokenization techniques is that you aren't left with any phrases. One of the simplest ways of disocvering meaningful phrases in text is to treat the problem as one of discovering statistical collocations. NLTK provides some routines to find collocations and includes a "demo" function that's a quick one-liner.

nltk_text = nltk.Text(tokens)
nltk_text.collocations()

Building collocations list
prime air; sneak peek; unmanned aerial; aerial vehicles;
http://t.co/w6kugw4egt http://t.c…; vehicles http://t.co/w6kugw4egt;
#primeair future; future deliveries; delivery drones; jeff bezos;
@amazon get; amazon prime; '60 minutes'; amazon unveils; cyber monday;
deliver packages; flying delivery; unveils flying; kindle fire; (also
helpful

Even without any prior analysis on tokenization, it's pretty clear what the topis is about as evidenced by this list of collocations. But what about the context in which these phrases appear? As it turns out, NLTK supplies another handy data structure that provides some insight as to how words appear in context called a concordance. Trying out the "demo functionality" for the concordance is as simple as just calling it as shown below.

Toward the bottom of the list of commonly occurring tokens, the words "amazing" and "holy" appear. The word "amazing" is interesting, because it is usually the basis of an emotional reaction, and we're interested in examining the reaction. What about word "holy"? What might it mean? The concordance will help us to find out...

nltk_text.concordance("amazing")
print
nltk_text.concordance("holy")

Building index...
Displaying 25 of 508 matches:
s - @variety http://t.c… this looks amazing how will it impact drone traffic? -
it? amazon prime air delivery looks amazing http://t.co/icizw5vfrw rt @jeffreyg
gift card? @budgetearth &amp; other amazing bloggers are giving one away! ends 
k? damn that amazon prime air looks amazing im sure it would freak out some peo
egt http://t.c… @munilass amazon is amazing for what i use it for i'm okay with
wwglyqrq just in bonnie sold again! amazing book http://t.co/jce902iros #best-s
ase of 1000) http://t.co/d6l8p3jgbz amazing prospects! “@brianstelter on heels 
riety http://t.c… rt @dangillmor by amazing coincidence amazon had a youtube dr
rd_ferguson amazon prime air sounds amazing *hot* kindle fire hdx wi-fi 7' tabl
t.co/hrgxtrlumx this is going to be amazing if the faa allows it welcome to the
lying grill #primeair is absolutely amazing i'm excited to see the progress/dev
.co/w6kugw4egt http://t.c… the most amazing thing to me about amazon - when bez
//t.co/cad9zload3 rt @dangillmor by amazing coincidence amazon had a youtube dr
that 60 minutes piece on amazon was amazing what an incredible company and deli
 jesus christ this is real and it’s amazing erohmygerd http://t.co/m4salqm0lo r
/t.co/0trwr9qsoc rt @semil the most amazing thing to me about amazon - when bez
yeah no this @amazon drone stuff is amazing me but i have the same questions as
hqfg… 30 minutes delivery by amazon amazing http://t.co/ofu39suiov i really don
eat show at #60minutes amazon is an amazing company! rt @zachpagano next year d
on's future drone shipping tech was amazing to see amazon unveils futuristic pl
the first review on this product is amazing http://t.co/20yn3jguxu rt @amazon g
ttp://t.co/s2shyp48go this would be amazing  jeff bezos promises half-hour ship
wugrxq2oju have you guys seen these amazing steals on amazon?? wow!! some of my
ttp://t.co/mhqfg… rt @dangillmor by amazing coincidence amazon had a youtube dr
bezo http://t.co/2jt6pgn8an this is amazing rt @rdizzle7 rt @marquisgod dog rt 

Displaying 25 of 546 matches:
 @brocanadian http://t.co/zxyct2renf holy shit rt @amazon get a sneak peek of 
our shipping with amazon prime air - holy cow this is the future - http://t.co
eo) http://t.co/hi3gviwto7 #technews holy shit wtf http://t.co/p3h2wn5pba awes
es'! (other 1 suggested was usa 1!!) holy shit jeff bezos promises half-hour s
es http://t.co/k… rt @joshuatopolsky holy shit jeff bezos promises half-hour s
//t.co/tjdtdpkaaf rt @joshuatopolsky holy shit jeff bezos promises half-hour s
//t.co/0gpvgsyatm rt @joshuatopolsky holy shit jeff bezos promises half-hour s
 when amazon prime air is available? holy shit very funny! @amazon rt tim sied
w4egt http://t.c… rt @joshuatopolsky holy shit jeff bezos promises half-hour s
drones rt @alexpenn amazon prime air holy shit http://t.co/g2b7dumgbl amazon i
ijk0 via @oliryan rt @joshuatopolsky holy shit jeff bezos promises half-hour s
s http://t.co/w6kugw4egt http://t.c… holy shit amazon what? https://t.co/qrhkx
w4egt http://t.c… rt @joshuatopolsky holy shit jeff bezos promises half-hour s
//t.co/zggekdoyhv rt @joshuatopolsky holy shit jeff bezos promises half-hour s
 me for free? http://t.co/euutqyuoox holy crap @60minutes by using drones amaz
//t.c… rt @alexpenn amazon prime air holy shit http://t.co/g2b7dumgbl amazon i
one #primeair http://t.co/jgflwmcth5 holy fucking shit this is badass watch th
d in a back yard? rt @joshuatopolsky holy shit jeff bezos promises half-hour s
 @brocanadian http://t.co/zxyct2renf holy shit of course! “@verge delivery dro
how many lawyers… rt @joshuatopolsky holy shit jeff bezos promises half-hour s
#business #market rt @joshuatopolsky holy shit jeff bezos promises half-hour s
ke from amazon ;d rt @joshuatopolsky holy shit jeff bezos promises half-hour s
 fan of commercials rt @maryforbes14 holy crap @60minutes by using drones amaz
//t.co/lefbeec5ht rt @joshuatopolsky holy shit jeff bezos promises half-hour s
g each other down rt @joshuatopolsky holy shit jeff bezos promises half-hour s

It would appear that there is indeed a common thread of amazement in the data, and it's evident that @joshuatopolsky (who turns out to be Editor-in-chief of The Verge) is a commonly occurring tweet entity that warrants further investigation. Speaking of tweet entities, let's take an initial look at usernames, hashtags, and URLs by employing a simple heuristic to look for words prefixed with @, RT, #, and http to see what some of the most commonly occurring tweet entiteis are in the data.

# An crude look at tweet entities

entities = []
for txt in en_text.values:
    for t in txt.split():
        if t.startswith("http") or t.startswith("@") or t.startswith("#") or t.startswith("RT @"):
            if not t.startswith("http"):
                t = t.lower()
            entities.append(t.strip(" :,"))

entities_counter = Counter(entities)
for entity, freq in entities_counter.most_common()[:100]:
    print entity, freq

@amazon 8994
#primeair. 4767
http://t.c… 4391
http://t.co/w6kugw4EGt 3922
#amazon 3032
@60minutes 1325
#primeair 911
@mashable 787
@buzzfeed 774
@amazondrone 743
@bradplumer 743
@deathstarpr 735
#drones 729
http://t.co/JlFdNiHzks 711
http://t.co/BxSAVVzXZf 690
@badbanana 570
#kindle 467
@thenextweb 458
#amexamazon 441
http://t.co/MHqFG… 434
#giveaway 421
http:/… 417
#win 409
http:… 406
@techcrunch 391
#drone 383
#60minutes 380
http://t… 357
#tech 342
@levie 340
@variety 337
@breakingnews 331
@youtube 326
#cybermonday 325
@huffposttech 322
http://… 320
@jonrussell 304
@realjohngreen 300
#news 298
http://t.co/FNndPuMouA 294
@washingtonpost 284
@kotaku 283
@usatoday 283
http://t.… 280
#amazondrones 278
@nycjim 277
http://t.co/NG8… 270
http://t.co/rUu14XpvGo 270
@brianstelter 268
@majornelson 260
@benbadler 258
http://t.co/M7kqd26jVR 255
http… 254
@businessinsider 249
@huffingtonpost 245
http://t.co/DOEjXCC1vL 241
@sai 241
http://t.co/… 240
@verge 237
http://t.co/tAVunIXxo3 230
http://t.co/OqAZXaZdh9 223
http://t.co/sMBvNrasSN 221
#amazonprimeair 221
@buzzfeednews 214
@mattsinger 211
@ericstangel 211
#1 205
@byjasonng 200
#free 198
http://t.co/GQCvN0xZ7n 193
@americanexpress 190
#csrracing 183
@nickkristof 178
@orgasmicgomez 170
http://t.co/REJl0wLImt 168
http://t.co/zUWaeRjFC8 167
#ebook 165
http://t.co/4zk01TUugX 165
@joshuatopolsky 161
@percival 161
@lanceulanoff 160
@time 158
http://t.co/xQCjG6VQyb 157
#romance 154
#technology 154
#rt 148
@engadget 145
@arstechnica 142
@sapnam 142
http://t.co/JxBSx8rLBZ 141
http://t.co/IyV1qBhtJg 141
@youranonnews 139
@gizmodo 138
@abc 135
@mckaycoppins 133
http://t.co/zGgEkdOyhv 133
http://t.co/9ZBYtKpHce 132
@newsbreaker 132
http://t.co/z5JQkD4svO 132
@ 131

As you can see, there are lots of intersting tweet entities that give you helpful context for the announcement. One particularly notable observation is the appearance of "comedic accounts" such as @deathstarpr and @amazondrone near the top of the list, relaying a certain amount of humor. The tweet embedded below that references Star Wars was eventually retweeted over 1k times in response to the announcement! It wouldn't be difficult to determine how many retweets occurred just within the ~3 hour timeframe corresponding to the dataset we're using here.

First look at Amazon's new delivery drone. (Also helpful for finding Rebel bases on Hoth.) pic.twitter.com/JlFdNiHzks
— Death Star PR (@DeathStarPR) December 2, 2013

When you take a closer look at some of the developed news stories, you also see sarcasm, unbelief, and even a bit of frustration about this being a "publicity stunt" for Cyber Monday.

Note: There proper way of parsing out tweet entities from the entities field that you can see in the DataFrame. It's marginally more work but has the primary advantage that you can see the "expanded URL" which provides better insight into the nature of the URL since you'll know its domain name. See Example 9-10, Extracting Tweet Entities from Mining the Social Web for more on how to do that.

Summarize¶

We aspired to learn more about the general reaction to Amazon's announcement about Prime Air by taking an initial look at the data from Amazon's firehose, and it's fair to say that we learned a few things about the data without too much effort. Lots more could be discovered, but a few of the themes that we were able to glean included:

Amazement
Humor
Disbelief

Although these reactions aren't particularly surprising for such an outrageous announcement, you've hopefully learned enough that you could tap into Twitter's firehose to capture and analyze data that's of interest to you. There is no shortage of fun to be had, and as you've learned, it's easier than it might first appear.

Enjoy!

Recommended Resources¶

If you enjoy analyzing data from social websites like Twitter, then you might enjoy the book Mining the Social Web, 2nd Edition (O'Reilly). You can learn more about it at MiningTheSocialWeb.com. All source code is available in IPython Notebook format at GitHub and can be previewed in the IPython Notebook Viewer.

The book itself is a form of "premium support" for the source code and is available for purchase from Amazon or O'Reilly Media.