Twitter is what’s happening and what people are talking about right now, with hundreds of millions of Tweets sent each day. We’re a group of data scientists on the Twitter Data team who are helping people do more with this vast amount of data in less time. In this spirit, we are starting a series of tutorials that aim to help people work with Twitter data effectively. Each of the posts in this series centers around a real-life example project and provides MIT-licensed code that you can use to bootstrap your projects with our enterprise and premium API products. We hope this series is fruitful for you and we are excited to see what you’ll build.
In this inaugural post, we’ll introduce data collection, filtering, parsing, and summarizing. We want to be able to give readers an easy way to spin up on the Twitter Search APIs, tips around quickly examining data, and guidelines around one of the most difficult and most overlooked Twitter data challenges: making sure you get the right Tweets for the job.
This post is not meant to be a tutorial in Python or the PyData ecosystem and assumes that readers have a reasonable amount of technical sophistication, such that they could find the right Tweets for their use case in any language or framework. This tutorial uses Python because our group makes heavy use of the PyData stack (python, pandas, numpy, etc.), but a technical user could follow along in a programming language of their choice.
Typically you will start with a very high-level question that requires refinement to understand what data might be useful to you in answering that question. For this post, let’s start with a broad question:
“I want to understand airline customers while they fly.“
We’ll begin our refinement process from here.
Understanding the business need behind a question is paramount. Use it to understand where it is important to spend extra time to be exact, and where an estimation will suffice. Also use “why” to understand when the analysis needs to be complete–the best analysis is not useful if it is finished a month after a decision is made and presented.
Use this question to help define where to begin with the analysis and which data to pull. To understand airline customers’ behavior while they are flying and getting ready to fly, we are interested in people actually in an airport or on a plane, but we are not interested in people simply talking about airlines (sharing news stories about an airline, for instance).
Deciding on a relevant timeframe is all about the business use case: if you’re tracking a trend over a long period of time, you may need years of data; if you’re looking at the audience of a single TV premier, a few days of data may be enough.
In the case of our air travel question, we’re only interested in any given Twitter user in the few hours around their plane flight, so we don’t necessarily need to track Tweets over a long period of time. We do, however, want to get enough of a sample to make generalizations about how people Tweet before and after flights, so we should make sure to collect enough data over a long enough period of time to examine multiple flights.
Using simple rule filters like country and language can go a long way towards helping us analyze Tweets from the right people. In this case, we’re interested in people from all demographics, as long as they are also passengers on an airline. We can likely identify those people through their language (“About to take off, headed home to Boulder!”), their granular geo-tagged location, or Twitter place (“Denver International Airport”).
We can use Twitter’s geo data and enrichments to make sure that our users are in relevant locations, if that is important to the question. Another way to approximate a user’s location might be the language they speak, or even the time that they are Tweeting (if you’re collecting Tweets at 2AM PST, don’t expect to see a lot of content from California).
Remember, selecting only users for whom we have a very granular locations (like a geo-tag near an airport) means that we only get a sample of users. For studies where we want to know generally what people are talking about that might not be a problem, but it isn’t as effective for studies where we want an exact count. Keep those trade-offs in mind when designing a data query.
Once we have a good understanding of our question, the next steps are to figure out how to get to the answer using Twitter data. We’ll have to get the right data, understand it, and filter out any irrelevant data.
Steps to get to an answer
In order to pull Tweets from the Search APIs, you will first have to understand how to write a search query in Twitter’s premium / enterprise search API’s query language. The most important basics are these:
cat
, and matches when that word occurs within the
text of the Tweet. Token matches always disregard case.space
are treated as
queries and-ed together. For instance, the query “cat dog
” would
search for Tweets with both cat
and dog
somewhere in the text
of the Tweet.OR
” (capitalization is important
here) between two tokens means that your rule will match on either
term (without needing both to be present). The rule “cat OR dog
”
will match on a Tweet with either “cat
” or “dog
” in the
Tweet text.()
to group operators together. In the
query-language order of operations, “and” (_
) is applied before
“or” (OR
), so use parentheses to make your groups explicit.
“cat dog OR bunny
” is different from “cat (dog OR bunny)
”
(can you see why?).-
” to negate operators (such as terms
that you definitely don’t want in your dataset). You might search
for Tweets about cats and not about dogs with this type of query
“cat -dog
”.Detailed information on operators can be found here. We’ll introduce and use some more advanced operators later.
cat
” (the token match operator) will only match text
that the user typed out for that Tweet (it won’t for instance, match
the word “cat
” in the users’s name or bio).In order to Search for Tweets, we’re going to use Twitter’s enterprise-grade Twitter search tool. This tools allows a user to make a request for Tweets by specifying a rule (more on that in a minute) that matches some Tweets and retrieve results. This tutorial is also compatible with the 30-day search API, though you will have to change some of the dates for searching due to the 30-day historic limitation.
"Boarding the plane to fly to Tokyo in a few hours! So excited!"
"boarding", "the", "plane", "to", "fly", "to", "tokyo", "in", "a", "few", "hours", "so, "excited"
A rule that collected this Tweet might have been "plane"
(because
the exact token "plane"
is included in the token list). You should
note that these matches rely on complete tokens: this Tweet would
not match if our rule was "airplane"
.
"airplane"
or the word "plane"
or the word
"flying"
to get all of the relevant Tweets. In a single rule, use
the operator OR
(capitalization is important here) to search for
any of a set of keywords."airplane OR plane OR flying"
"flying"
and the word "plane"
but only when they appear together. Here, you
would want to use Boolean AND
logic to combine tokens into a
single rule. In the syntax of Twitter’s Search API, AND
is simply
represented by a space. > rule: flying plane
"I'm flying in a plane!"
"I'm flying!"
or "I'm in a plane!"
"
to combine tokens and look for them in a
specific phrase. > rule: "air travel"
"Air travel is great"
"Travel by air is great"
or
"Traveling in an airplane"
"flying"
,
"I don't give a flying f**k what you do."
would match your rule
(true story: I’ve had to exclude the phrase "flying f**k"
from
this analysis 😳).-
” to exclude, and use parentheses to group together
logical clauses: > "(airplane OR plane OR flying) -fuck"
This isn’t a comprehensive guide, and more clarification can be found in our documentation.
This package can be found at
https://github.com/twitterdev/search-tweets-python and implements a
Python wrapper to: - Create appropriately formatted rule payloads (date
and rule query parameters) - Make requests of the API - Gracefully
handle pagination of Search APIs results (be aware–the tool can make
multiple API calls, so but sure to set the max_tweets
parameter,
which we’ll point out) - Load the resultant Tweet text data as Python
objects
If you want to run this notebook, it is hosted
here.
Clone this repo and you’ll see this notebook in the
examples/finding_the_right_data
directory. Please see the
accompanying README.md
file for full instructions. We’ve provided
both a pip-ready finding_the_right_data_requirements.txt
file and a
conda environment file, finding_the_right_data_conda_env.yml
that
allows an easy virtual environment for this example. This example
assumes Python 3.6.
Credentials
Please go ahead and make a YAML file named .twitter_keys.yaml
in
your home directory.
For premium customers, the simplest credential file should look like this:
search_tweets_api:
account_type: premium
endpoint: <FULL_URL_OF_ENDPOINT>
consumer_key: <CONSUMER_KEY>
consumer_secret: <CONSUMER_SECRET>
For enterprise customers, the simplest credential file should look like this:
search_tweets_api:
account_type: enterprise
endpoint: <FULL_URL_OF_ENDPOINT>
username: <USERNAME>
password: <PW>
The rest of the example will assume ~/.twitter_keys.yaml
exists,
though you can specify your connection information directing in the
notebook or using an environment variable if you want. For more
information, please see the searchtweets
section on credential
handling.
The load_credentials
function parses this file and we’ll save the
search_args
variable for use throughout the session.
from tweet_parser.tweet import Tweet
from searchtweets import (ResultStream,
collect_results,
gen_rule_payload,
load_credentials)
search_args = load_credentials(filename="~/.twitter_keys.yaml",
account_type="enterprise")
In the following cells, we’ll define our rule and rule payloads, then
use the collect_results
method to get our tweets.
First - it is often convenient to assign a rule to a variable. Rules are
strings and can be simple strings, e.g., "(flying OR plane)"
or be
literal blocks, which can be easier to read for long rules:
"""
(flying
OR plane
OR landed)
"""
The remaining functions will parse newlines correctly.
Below, we’ll start with a simple rule.
_rule_a = "(flying OR plane OR landed OR airport OR takeoff)"
The function, gen_rule_payload
, will generate the JSON used to make
requests to the search endpoint. Its full API doc lives
here,
but it has several main parameters:
pt_rule
(or the first positional arg) – The string version of a
powertrack ruleresults_per_call
– number of tweets or counts returned per API.
Defaults to 500 to reduce API call usage, but you can change it as
needed. It can range from 100-500.from_date
– Date the starting time of your search.to_date
– Date for the end time of your search.count_bucket
– If using the counts api endpoint, this will define
the count bucket for which tweets are aggregated. This must be set if
you want to use the Counts API.The time searches can be specified via the following dates or datetimes, down to minutes, e.g.:
2017-12-31
2017-12-31 23:50
2017-12-31T23:50
201712312350
Specifying a date will return a date starting at 00:00.
The from_date
and to_date
parameters are optional. The search
API defaults to returning up to the last 30 day’s worth of tweets.
rule_a = gen_rule_payload(_rule_a,
from_date="2017-07-01",
to_date="2017-08-01",
results_per_call=500)
rule_a
'{"query":"(flying OR plane OR landed OR airport OR takeoff)","maxResults":500,"toDate":"201708010000","fromDate":"201707010000"}'
The collect_results
function is the fastest entry point to getting
Tweets. It has several parameters:
rule
: a valid payload made from gen_rule_payload
max_results
: the maximum number of Tweets or counts you want to
receiveresult_stream_args
: the search arguments and authentication info
for this request.And it returns a list of results, Tweets or counts.
results_list = collect_results(rule_a,
max_results=500,
result_stream_args=search_args)
# hark, a Tweet!
results_list[0]
{'contributors': None,
'coordinates': None,
'created_at': 'Mon Jul 31 23:59:59 +0000 2017',
'entities': {'hashtags': [{'indices': [43, 50], 'text': 'defcon'}],
'symbols': [],
'urls': [],
'user_mentions': []},
'favorite_count': 0,
'favorited': False,
'filter_level': 'low',
'geo': None,
'id': 892173026116202498,
'id_str': '892173026116202498',
'in_reply_to_screen_name': None,
'in_reply_to_status_id': None,
'in_reply_to_status_id_str': None,
'in_reply_to_user_id': None,
'in_reply_to_user_id_str': None,
'is_quote_status': False,
'lang': 'en',
'matching_rules': [{'tag': None}],
'place': None,
'quote_count': 0,
'reply_count': 0,
'retweet_count': 0,
'retweeted': False,
'source': '<a href="http://tapbots.com/tweetbot" rel="nofollow">Tweetbot for iΟS</a>',
'text': 'Left for airport at 4:45AM, pleased to see #defcon hackers still partying at Caesars bars. Not pleased to be heading to airport at 4:45AM.',
'truncated': False,
'user': {'contributors_enabled': False,
'created_at': 'Tue Jul 26 12:59:48 +0000 2016',
'default_profile': True,
'default_profile_image': False,
'derived': {'klout': {'influence_topics': [{'id': '14711',
'name': 'Computers',
'score': 0.66,
'url': 'http://klout.com/topic/id/14711'},
{'id': '10000000000000000019',
'name': 'Aerospace',
'score': 0.51,
'url': 'http://klout.com/topic/id/10000000000000000019'},
{'id': '5715144008489027056',
'name': 'Hacking',
'score': 0.51,
'url': 'http://klout.com/topic/id/5715144008489027056'},
{'id': '6333513374221291742',
'name': 'Information Security',
'score': 0.48,
'url': 'http://klout.com/topic/id/6333513374221291742'},
{'id': '10000000000000000013',
'name': 'Armed Forces',
'score': 0.46,
'url': 'http://klout.com/topic/id/10000000000000000013'}],
'interest_topics': [{'id': '6333513374221291742',
'name': 'Information Security',
'score': 0.7000000000000001,
'url': 'http://klout.com/topic/id/6333513374221291742'},
{'id': '5715144008489027056',
'name': 'Hacking',
'score': 0.66,
'url': 'http://klout.com/topic/id/5715144008489027056'},
{'id': '10000000000000018702',
'name': 'Cybersecurity',
'score': 0.64,
'url': 'http://klout.com/topic/id/10000000000000018702'},
{'id': '10000000000000000013',
'name': 'Armed Forces',
'score': 0.63,
'url': 'http://klout.com/topic/id/10000000000000000013'},
{'id': '14711',
'name': 'Computers',
'score': 0.62,
'url': 'http://klout.com/topic/id/14711'}],
'profile_url': 'http://klout.com/user/id/120189854649800254',
'score': 13,
'user_id': '120189854649800254'}},
'description': 'davrik',
'favourites_count': 1,
'follow_request_sent': None,
'followers_count': 15,
'following': None,
'friends_count': 27,
'geo_enabled': False,
'id': 757923373951422464,
'id_str': '757923373951422464',
'is_translator': False,
'lang': 'en',
'listed_count': 0,
'location': 'cyber cyber cyber',
'name': 'davrik',
'notifications': None,
'profile_background_color': 'F5F8FA',
'profile_background_image_url': '',
'profile_background_image_url_https': '',
'profile_background_tile': False,
'profile_banner_url': 'https://pbs.twimg.com/profile_banners/757923373951422464/1469539141',
'profile_image_url': 'http://pbs.twimg.com/profile_images/757928014541905920/raYPrm1a_normal.jpg',
'profile_image_url_https': 'https://pbs.twimg.com/profile_images/757928014541905920/raYPrm1a_normal.jpg',
'profile_link_color': '1DA1F2',
'profile_sidebar_border_color': 'C0DEED',
'profile_sidebar_fill_color': 'DDEEF6',
'profile_text_color': '333333',
'profile_use_background_image': True,
'protected': False,
'screen_name': '0davrik',
'statuses_count': 19,
'time_zone': None,
'translator_type': 'none',
'url': None,
'utc_offset': None,
'verified': False}}
Let’s take the 1st Tweet from our results list and discuss its various elements.
Tweet data is returned from the API as a single JSON payload with a
results
array, containing many JSON Tweet payloads. The
searchtweets
package parses that data for you (including abstracting
away concerns about the specific format of the Tweet payloads) using our
`tweet_parser
<https://github.com/tw-ddis/tweet_parser>`__ package
and handles any errors caused by non-Tweet messages (like logging
messages), returning Tweet
objects (I’ll explain about that in a
second).
To better understand Tweets, you should check out our developer website for information on Tweet payload elements. For now, I’m going to explain a few key elements of a Tweet and how to access them.
We’ll grab a single example Tweet: just the first one from our list.
This is a Tweet object, which has all of the properties of a Python
dict
as well as some special, Tweet-specific attributes.
example_tweet = results_list[0]
example_tweet["id"]
892173026116202498
A note on Tweet formatting and payload element naming
It just so happens that the format of this Tweet is “original format,” so we can look up the keys (names of payload elements) on our support website. It’s possible (but unlikely) that your Search API stream is configured slightly differently, and you’re looking at “activity streams” formatted Tweets (if so, this is completely fine). We’ll add the equivalent “activity streams” keys in the comments.
You can look up specific of Tweet payload elements here.
Note that the tweet_parser
package will abstract away the format of
the Tweet, and either activity streams or original format will work fine
in this notebook.
Now, the text of a Tweet:
The text of the Tweet (the thing you type, the characters that display in the Tweet) is stored as a top-level key called “text.”
print(results_list[0]["text"])
# uncomment the following if you appear to have an activity-stream formatted Tweet
# (the names of the payload elements are different, the data is the same):
# results_list[0]["body"]
Left for airport at 4:45AM, pleased to see #defcon hackers still partying at Caesars bars. Not pleased to be heading to airport at 4:45AM.
You can access, for instance, a Tweet’s #hashtags like this:
results_list[0]["entities"]["hashtags"]
# uncomment the following if you appear to have an activity-stream formatted Tweet
# results_list[0]["twitter_entities"]["hashtags"]
[{'indices': [43, 50], 'text': 'defcon'}]
# in case that first Tweet didn't actually have any hashtags:
[x["entities"]["hashtags"] for x in results_list[0:10]]
# uncomment the following if you appear to have an activity-stream formatted Tweet
# [x["twitter_entities"]["hashtags"] for x in results_list[0:10]]
[[{'indices': [43, 50], 'text': 'defcon'}],
[],
[{'indices': [22, 30], 'text': 'Algerie'}],
[],
[],
[{'indices': [1, 5], 'text': 'New'}],
[],
[],
[],
[]]
You will always need to understand Tweet payloads to work with Tweets,
but a Python package from this team (pip install tweet_parser
)
attempts to remove some of the hassle of accessing elements of a Tweet.
This package works seamlessly with both possible Twitter data formats,
and supplies the Tweet
object we referred to earlier.
Before doing a lot of work with this package (but not right this
minute!), we would encourage you to read the documentation for the
tweet_parser
package at https://twitterdev.github.io/tweet_parser/.
The package is open-source and available on GitHub at
https://github.com/twitterdev/tweet_parser. Feel free to submit issues
or make pull requests.
The Tweet
object has properties that allow you to access useful
elements of a Tweet without dealing with its format. For instance, we
can access some of the text fields of a Tweet with:
print("Literally the content of the 'text' field: \n{}\n" .format(results_list[0].text))
print("Other fields, like the content of a quoted tweet or the options of a poll: {}, {}".format(
results_list[0].quoted_tweet.text if results_list[0].quoted_tweet else None,
results_list[0].poll_options))
Literally the content of the 'text' field:
Left for airport at 4:45AM, pleased to see #defcon hackers still partying at Caesars bars. Not pleased to be heading to airport at 4:45AM.
Other fields, like the content of a quoted tweet or the options of a poll: None, []
As the Twitter platform changes, so do the fields in the data payload.
For instance, “extended” and “truncated” tweets have been introduced to
stop 280 character from causing breaking changes. The Tweet
object
has a convenience property .all_text
that gets “all” the text that a
Tweet displays.
results_list[0].all_text
'Left for airport at 4:45AM, pleased to see #defcon hackers still partying at Caesars bars. Not pleased to be heading to airport at 4:45AM.'
We cannot dive into detail on every property of Tweet
that we’ll use
in this notebook, but hopefully you get the idea. When we call
.something
on a Tweet
, that .something
is provided in the
tweet_parser
package and documented there.
For now, we will describe the elements of the payload that we’ll use in this analysis.
created_at
element of a Tweet, but since we’re
using the tweet_parser
package, we can access a Python datetime
object with tweet.created_at_datetime
or an integer representing
seconds since Jan 1, 1970 with tweet.created_at_seconds
.tweet["user"]["id"]
, or tweet.user_id
with the
tweet_parser
package), the user’s screen name
(tweet.screen_name
, keep in mind that the user name may change),
the user’s timezone, potentially their derived location, their bio (a
user-entered description at, tweet.bio
) and more.tweet_parser
-provided attribute tweet.all_text
, which
aggregates all of the possible text fields of a Tweet (the quoted
text, the poll text, the hidden @-mentions, etc) into one string.tweet_parser
Tweet attribute tweet.hashtags
provides a
list of hashtags.tweet.user_mentions
for a list of users (names and ids)
mentioned in a Tweet.tweet.geo_coordinates
to get locations.Now we can take all of those aforementioned elements, parse them out of my Tweets, and put them in a Pandas Dataframe.
import pandas as pd
# make a pandas dataframe
data_df = pd.DataFrame([{"date": x.created_at_datetime,
"text": x.all_text,
"user": x.screen_name,
"bio": x.bio,
"at_mentions": [u["screen_name"] for u in x.user_mentions],
"hashtags": x.hashtags,
"urls": x.most_unrolled_urls,
"geo": x.geo_coordinates,
"type": x.tweet_type,
"id": x.id} for x in results_list]).set_index("date")
You’re going to look at this Tweet data and you (should) feel a little sad. We did all that work talking about rules, you’ve heard so much about Twitter data, and these Tweets mostly don’t seem relevant to our usecase at all. They’re not from our target audience (people flying on planes) and they’re not even necessarily about plane trips. What gives?
There’s a reason we started out with only one API call: we didn’t want to waste calls pulling Tweets using an unrefined ruleset. This notebook is going to be about making better decisions about rules, iterating, and refining them until you get the Tweets that you want. And those Tweets are out there - there are, after all, lots of Tweets.
# data, parsed out and ready to analyze
data_df.head()
at_mentions | bio | geo | hashtags | id | text | type | urls | user | |
---|---|---|---|---|---|---|---|---|---|
date | |||||||||
2017-07-31 23:59:59 | [] | davrik | None | [defcon] | 892173026116202498 | Left for airport at 4:45AM, pleased to see #de... | tweet | [] | 0davrik |
2017-07-31 23:59:58 | [God1stAmerica2d] | Proudly Politically Incorrect. Widow of US Mar... | None | [] | 892173019040202752 | Why is this even a question. ID needed for lic... | retweet | [https://twitter.com/i/web/status/892144209418... | AliasHere |
2017-07-31 23:59:57 | [LPLdirect] | 🇬🇳🇺🇸 | None | [Algerie] | 892173017186435073 | 🔴INFOS #Algerie\n\nDeux pilotes d' Air Algérie... | retweet | [https://twitter.com/i/web/status/892152194920... | Wyzzyddl |
2017-07-31 23:59:57 | [Simbaki_] | 🏳️🌈 sc : kailabrielleeee / match ? | None | [] | 892173016372838400 | We are unappreciative... we don't deserve Take... | retweet | [] | KailaBrielleee_ |
2017-07-31 23:59:56 | [realDonaldTrump] | Consultant • 💻 Network Engineer • brunch advoc... | None | [] | 892173013625573376 | @realDonaldTrump Watch Trump get lost going fr... | quote | [https://twitter.com/i/web/status/892173013625... | LoveRunandPray |
Descriptive statistics can tell you a lot about your dataset with very little special effort and customization. Having a good set of descriptive statistics coded up that you always run on a new dataset can be helpful. Our team maintains Gnip-Tweet-Evaluation a repository on GitHub that contains some tools to do quick evaluation of a Tweet corpus (that package is useful as a command line tool, and as a template for Tweet evaluation).
I’m going to walk through a few different statistics that might help us understand the data better: - What is generating the Tweets?: One statistic that is common and useful in understanding a group of Tweets is understanding how many of them are Retweets or Quote Tweets. Another interesting data point can be looking at the application that created the Tweet itself. - Common words: What are the most common words in the corpus? Are they what we expected? Are they relevant to the topic of interest? - Who is Tweeting: Who are the users who Tweet the most? Are there spammy users that should be removed? Are they important users that we need to pay attention to? - Where are they Tweeting from: What’s the distribution of Tweet locations? Is that surprising? - Time series: Are there spikes in the Tweet volume? When do they occur? What drives the spikes? For the sake of filtering out noise (or news stories that we don’t care about) it may be important to identify spikes and filter out the Tweets that drive those spikes.
# plotting library
import matplotlib.pyplot as plt
# pretty plots
plt.style.use("bmh")
# better sizing for the notebook
plt.rcParams['figure.figsize'] = (10, 5)
%matplotlib inline
Briefly, let’s note that there are several different Twitter platform
actions that can create a Tweet. - original Tweet: the user created
a Tweet by typing it into the Tweet create box or otherwise hitting the
statuses/update
endpoint - Retweet: the user reposted the Tweet
to their own timeline, without adding and comment - Quote Tweet:
user added commentary to a reposted Tweet, essentially a Tweet with
another Tweet embedded in it
This is sometimes an important distinction: how much do Retweets or Quote Tweets represent the experience of the user reposting? Do we actually want Retweets in our particular dataset? Are they likely to tell us anything about users who are actually traveling (put it this way: if a user Retweets a story about being at an airport, how likely does it seem that they are at an airport)?
data_df[["type","id"]].groupby("type").count()
id | |
---|---|
type | |
quote | 16 |
retweet | 298 |
tweet | 186 |
Definitions: - corpus: A collection of all of the documents that we are analyzing. - document: A single piece of writing, in our case, a single Tweet - token: Some characters from a document. Ideally, tokens would be common enough to occur in many documents, and have some semantic meaning–they provide us with a way to infer semantic correlation across documents. - stop word: A stop word (something like “and”,”but”,”the”) is a very common word with little semantic meaning, and we often exclude these words from word counts or NLP models.
I’m going to tokenize the Tweets (split each Tweet into a list of tokens) using the nltk TweetTokenizer, which splits on spaces and throws away some punctuation. We will also throw away any tokens that contain no letters, and throw away any tokens that are less than 3 characters long. These choices–how to define a token, and which tokens to count, are somewhat arbitrary, but depend greatly on how language is used in the dataset. For instance, many tokenizers would remove the symbols “@” and “#” as punctuation, but on Twitter, those symbols mean something (“@Jack” is potentially very different from “jack”), and should perhaps be preserved.
from nltk.tokenize import TweetTokenizer
from nltk.stem.porter import PorterStemmer
def tweet_tokenizer(verbatim):
try:
tokenizer = TweetTokenizer()
all_tokens = tokenizer.tokenize(verbatim.lower())
# this line filters out all tokens that are entirely non-alphabetic characters
filtered_tokens = [t for t in all_tokens if t.islower()]
# filter out all tokens that are <=2 chars
filtered_tokens = [x for x in filtered_tokens if len(x)>2]
except IndexError:
filtered_tokens = []
return(filtered_tokens)
Counting
Of course, we could simply create a dictionary of tokens in our corpus and iterate through the entire corpus of Tweets and adding “1” to a count every time we encounter a certain token, but it’s more convenient to start using the scikit-learn API now and put our token counts into a format that’s easy to use later. If you’re unfamiliar with scikit-learn, it is an excellent Python machine learning framework that implements many common ML algorithms in a common building-block-able API.
I am going to use scikit-learn’s CountVectorizer to count tokens and turn the corpus into a document-term matrix. This makes it easy to count token frequencies. Note that the CountVectorizer allows us to count not only terms (“tokens”) but n-grams (collections of tokens). For instance, if “plane” and “trip” are both tokens split on spaces, then explicit phrase “plane trip” might be a 2-gram in our corpus.
# get the most common words in Tweets
from sklearn.feature_extraction.text import CountVectorizer
def get_frequent_terms(text_series, stop_words = None, ngram_range = (1,2)):
'''
Input:
text_series: a list or series of documents
stop_words: a list of stop_words to ignore, or the string 'english',
which uses a built-in stop word list for the english language.
By default, there are no stop words used
ngram_range: a single int, or a 2 tuple representing the range of ngrams to count.
the default (1,2) counts 1- and 2- grams.
Returns:
a dataframe of counts, indexed by n-gram
'''
count_vectorizer = CountVectorizer(analyzer = "word",
tokenizer = tweet_tokenizer,
stop_words = stop_words, # try changing the stopword sets that we use.
# notice that many top terms are words like
# "and" and "the"
ngram_range = ngram_range # you can change this to count frequencies of
# ngrams as well as single tokens
# a range of (1,2) counts 1-grams (single tokens) and
# 2-grams (2-token phrases)
)
term_freq_matrix = count_vectorizer.fit_transform(text_series)
terms = count_vectorizer.get_feature_names()
term_frequencies = term_freq_matrix.sum(axis = 0).tolist()[0]
term_freq_df = (pd.DataFrame(list(zip(terms, term_frequencies)), columns = ["token","count"])
.set_index("token")
.sort_values("count",ascending = False))
return term_freq_df
term_freq_df = get_frequent_terms(data_df["text"],
stop_words = "english") # stop_words = "english" removes words like 'and'
term_freq_df.head(20)
count | |
---|---|
token | |
airport | 196 |
flying | 123 |
plane | 94 |
landed | 48 |
home | 31 |
trump | 30 |
today | 29 |
harry | 27 |
just | 27 |
flight | 26 |
flying home | 25 |
got | 24 |
meeting | 24 |
initial | 23 |
dictated | 23 |
statement | 23 |
personally | 22 |
personally dictated | 22 |
trump personally | 22 |
g20 | 22 |
term_freq_df.tail(10)
count | |
---|---|
token | |
flying broom | 1 |
flying bullets | 1 |
flying california | 1 |
flying car | 1 |
flying cases | 1 |
flying centre | 1 |
flying chalice | 1 |
flying check | 1 |
flying close | 1 |
蕾阿蕾丶elahe https://t.co/y74vcswiuq | 1 |
Hand labeling
Never underestimate the value of reading through your data. Actually look at the terms in the Tweets. Identify the Tweets that look like ones you are actually interested in.
Consider doing something like searching for specific, relevant, phrases that you hope will show up in your data:
data_df[data_df['text'].str.contains('flying home')]
at_mentions | bio | geo | hashtags | id | text | type | urls | user | |
---|---|---|---|---|---|---|---|---|---|
date | |||||||||
2017-07-31 23:58:33 | [WilkinsHarley] | None | None | [] | 892172664948871168 | @WilkinsHarley We would love too! Unfortunatel... | tweet | [] | ChrisLeFrenchy |
Or simply reading a few random Tweets in your data.
pd.set_option('display.max_colwidth', -1)
data_df[['text']].sample(5)
text | |
---|---|
date | |
2017-07-31 23:59:29 | @IFLYGLACIER I never complain about things like this, but Kalispell Airport security's reckless negligence was unacceptable and harmful behavior today. |
2017-07-31 23:58:08 | Gostei de um vídeo @YouTube https://t.co/4VXfrqw4yg Plane |
2017-07-31 23:57:31 | a whole 9 second of jeongyeon taking care of tzuyu's skirt from flying 😭💛 https://t.co/lSDVZYWfrR |
2017-07-31 23:59:34 | 🔴INFOS #Algerie\n\nDeux pilotes d' Air Algérie sont suspendu pour avoir laisser un enfant manipuler un avion en vol https://t.co/wh8Fmk71MX https://t.co/KHSz2nkvSs |
2017-07-31 23:57:04 | RT: This is why I'm jumping out of a plane on 18/08 to raise funds to help @LIONAID #SaveLions. Please sponsor me at https://t.co/6GY8WSmEkz https://t.co/U27B8YEofX |
A subsequent post will cover unsupervised clustering, which can help you group together Tweets to more easily make sense of them (you might group Tweets into 10 groups, and read a few Tweets from each group).
I’m going to use some Pandas tricks (.groupby
and .agg
) to find
the most commonly Tweeting users in the dataset.
(data_df[["user","bio","geo","id"]]
.groupby("user")
.agg({"id":"count","bio":"first","geo":"first"})
.sort_values("id",ascending = False)
.rename(columns={'id':'tweet_count'})
).head(15)
tweet_count | bio | geo | |
---|---|---|---|
user | |||
babyieturtle | 6 | ELF | YG/Rapper🌸 | NaN |
P3air | 3 | Worldwide insurance approved King Air Flight School and Turbo-Prop flight training (inflight and simulator) for initial, recurrent and transition pilots. | NaN |
CollectedN | 2 | Various News, Videos, Opinions. | NaN |
BloggerMe3 | 2 | @Steveirons edits a blog site interested in the future shape of Australia. Get involved. Make a blog. Visit your hometown. http://www.pinterest.com/bloggerme | NaN |
Moya_Monia | 2 | "If you can do what you do best and be happy, you're further along in life than most people." | NaN |
RPMSports18 | 2 | 24. University of South Alabama Alum. Just living life in a city below sea level. If you're reading this, then I probably popped up on your timeline. | NaN |
heskiwi94x | 2 | An adult fangirl &attended too many concerts. #OneDirection #Belieber #Mixer #Arianator #Lovatic #The1975 #Sheerio etc. #WWATourPhilly #OTRAPhilly. I met 1/4 | NaN |
___13Dec | 2 | ⠀⠀⠀⠀ ⠀⠀⠀⠀ ⠀⠀⠀⠀⠀⠀⠀⠀ ♡ ⠀⠀⠀⠀ ⠀⠀⠀⠀ ⠀⠀⠀ ⠀⠀⠀⠀ ⠀⠀⠀⠀TS. H. The1975 Kodaline Coldplay LANY OW⠀ ⠀⠀⠀ Linkinpark Troye Sivan Singto&Krist. PERAYA.⠀⠀⠀ ⠀⠀⠀⠀ ⠀⠀⠀⠀ | NaN |
billsplacehere | 2 | NaN | NaN |
Spotiflynet | 2 | NaN | NaN |
pdougmc | 2 | Retired, Interests: Gardening, photography, classical music, Astronomy, bicycling, Politics | NaN |
DaveDuFourNBA | 2 | Basketball Coach, Host "On the NBA with Dave DuFour", Contributor at @RealGM & @bballbreakdown\n\nCheck out my @Twitch channel https://www.twitch.tv/davedufournba | NaN |
remypost | 2 | im remy | he/him | 22 | i love videogames | im the ana main ur parents warned you about | i also make games/art: @remyripple | WIPs/sketches: @scissorskid | NaN |
EdyAntal | 2 | S A D B O Y S | NaN |
haneulvinkqq | 1 | 안녕하세요 ~ 태국아미😁 🇹🇭แอคบ่น เพ้อ รีทุกอย่างที่ขวางหน้า ชอบหลายวง อยู่ด้อมอาร์มี่ รักบังทัน 🐯ชิปวีมิน🐥 *💸YOU NEVER เปย์ ALONE💸* Beyond The Scene - BTS | NaN |
Doing time-series analysis with Twitter data will be covered in depth in a subsequent post, but it’s worth introducing here. Another great way to get a broader picture of our data is to understand when people are Tweeting about these keywords. Now, recall that earlier we only grabbed a tiny sample of data, so those Tweets only cover a few minutes of activity–not a good way to build a time series. If you have access to the enterprise level API, you will be able to make an API call to the counts endpoint to retrieve a time series without needing to use many calls to retrieve all Tweets over a time period. Let’s try it out.
Counts endpoint
The counts endpoint has exactly the same API as the search endpoint, but instead of returning Tweets, it returns counts of Tweets. This is especially useful when you want to quickly assess a large dataset without retrieving a lot of data.
# use the same "_rule" string
print("Recall, our rule is: {}".format(_rule_a))
count_rule_a = gen_rule_payload(_rule_a,
from_date="2017-07-01",
to_date="2017-08-01",
results_per_call=500,
count_bucket="hour")
counts_list = collect_results(count_rule_a, max_results=24*31, result_stream_args=search_args)
Recall, our rule is: (flying OR plane OR landed OR airport OR takeoff)
The resulting counts payload is a list of counts and UTC timestamps.
counts_list[0:5]
[{'count': 9018, 'timePeriod': '201707312300'},
{'count': 9900, 'timePeriod': '201707312200'},
{'count': 9251, 'timePeriod': '201707312100'},
{'count': 9829, 'timePeriod': '201707312000'},
{'count': 9315, 'timePeriod': '201707311900'}]
# plot a timeseries of the Tweet counts
tweet_counts_original = (pd.DataFrame(counts_list)
.assign(time_bucket = lambda x: pd.to_datetime(x["timePeriod"]))
.drop("timePeriod",axis = 1)
.set_index("time_bucket")
.sort_index()
)
fig, axes = plt.subplots(nrows=1, ncols=2, figsize = (15,5))
(tweet_counts_original
.plot(ax=axes[0],
title = "Count of Tweets per hour",
legend = False));
(tweet_counts_original
.resample("D")
.sum()
.plot(ax=axes[1],
title="Count of Tweets per day",
legend=False));
How can we use this information to understand our search data?
Consider the type of Tweets that we are looking for: we’re looking for Tweets from people who are talking about air travel, hopefully people talking about their own experiences using air travel. It’s pretty likely that air travel has a relatively regular pattern, with probably some big seasonal fluctuations (Thanksgiving?) and some daily fluctuations (people fly, for the most part during the day); it’s unlikely that 2 or 3 times as many people fly on any given Monday vs the Mondays around it.
We can use an intuition about what spikes mean in our data to either filter out high volume noise, or, in the case where spikes are what we are seeking out (say we wanted the audience for a movie release that happens on a specific day), zooming in on important time periods.
Let’s choose a few large spikes in this data and investigate further, then exclude that topic from our final Twitter dataset.
# you can look at the plots to get a sense of what the highest-volume time periods are, or just sort your dataframe
(tweet_counts_original
.resample("D")
.sum()
.sort_values(by = "count", ascending = False)
.head())
count | |
---|---|
time_bucket | |
2017-07-17 | 334922 |
2017-07-26 | 301029 |
2017-07-14 | 286049 |
2017-07-18 | 276175 |
2017-07-29 | 259820 |
# let's look at the highest volume day
spike_rule_717 = gen_rule_payload(_rule_a,
from_date="2017-07-17",
to_date="2017-07-18",
results_per_call=500)
spike_results_list_717 = collect_results(spike_rule_717, max_results=500, result_stream_args=search_args)
# what are these Tweets about?
get_frequent_terms([x.all_text for x in spike_results_list_717],
stop_words = "english").head(20)
count | |
---|---|
token | |
got | 459 |
flying | 273 |
security | 228 |
instead | 227 |
flying cars | 225 |
cars | 225 |
office | 224 |
robot drowned | 223 |
got suicidal | 223 |
drowned promised | 223 |
building got | 223 |
building | 223 |
office building | 223 |
got security | 223 |
robots | 223 |
instead got | 223 |
suicidal | 223 |
drowned | 223 |
suicidal robots | 223 |
security robot | 223 |
# It's sometimes important to get the context of a full Tweet
[x.all_text for x in spike_results_list_717 if "drowned" in x.all_text.lower()][0:10]
['Our D.C. office building got a security robot. It drowned itself.\n\nWe were promised flying cars, instead we got suicidal robots. https://t.co/rGLTAWZMjn',
'Our D.C. office building got a security robot. It drowned itself.\n\nWe were promised flying cars, instead we got suicidal robots. https://t.co/rGLTAWZMjn',
'Our D.C. office building got a security robot. It drowned itself.\n\nWe were promised flying cars, instead we got suicidal robots. https://t.co/rGLTAWZMjn',
'Our D.C. office building got a security robot. It drowned itself.\n\nWe were promised flying cars, instead we got suicidal robots. https://t.co/rGLTAWZMjn',
'Our D.C. office building got a security robot. It drowned itself.\n\nWe were promised flying cars, instead we got suicidal robots. https://t.co/rGLTAWZMjn',
'Our D.C. office building got a security robot. It drowned itself.\n\nWe were promised flying cars, instead we got suicidal robots. https://t.co/rGLTAWZMjn',
'Our D.C. office building got a security robot. It drowned itself.\n\nWe were promised flying cars, instead we got suicidal robots. https://t.co/rGLTAWZMjn',
'Our D.C. office building got a security robot. It drowned itself.\n\nWe were promised flying cars, instead we got suicidal robots. https://t.co/rGLTAWZMjn',
'Our D.C. office building got a security robot. It drowned itself.\n\nWe were promised flying cars, instead we got suicidal robots. https://t.co/rGLTAWZMjn',
'Our D.C. office building got a security robot. It drowned itself.\n\nWe were promised flying cars, instead we got suicidal robots. https://t.co/rGLTAWZMjn']
# let's look at the highest volume day
spike_rule_726 = gen_rule_payload(_rule_a,
from_date="2017-07-26",
to_date="2017-07-27",
results_per_call=500)
spike_results_list_726 = collect_results(spike_rule_726, max_results=500, result_stream_args=search_args)
get_frequent_terms([x.all_text for x in spike_results_list_726], stop_words = "english").head(20)
count | |
---|---|
token | |
airport | 264 |
plane | 124 |
gimpo | 111 |
gimpo airport | 106 |
tokyo | 69 |
flying | 68 |
airport tokyo | 59 |
preview | 58 |
crash | 56 |
sooyoung | 56 |
seohyun | 55 |
small | 49 |
utah | 49 |
small plane | 48 |
crash utah | 46 |
highway | 46 |
#news | 46 |
die | 45 |
highway https://t.co/sizpfmclq7 | 45 |
#news #almalki | 45 |
[(x.all_text, x.tweet_type) for x in spike_results_list_726 if "gimpo" in x.all_text.lower()][0:10]
[('[HD PIC] 170726 Gimpo Airport - Our boys Eunhyuk, Leeteuk and Donghae looking so good in their airport fashion! [3P] (Cr:As Tagged) https://t.co/tKCT7QApsh',
'retweet'),
('170726 Gimpo Airport #Donghae [卡卡_kaka_KAKA] https://t.co/UIAgC2SroA',
'retweet'),
('[Preview] 170727 Seohyun - Gimpo airport by seohyuntwunion https://t.co/klNNEAf5Ko',
'retweet'),
('[Preview] 170727 Sooyoung - Gimpo airport by jtt_muk https://t.co/uKrcF8lmNe',
'retweet'),
('[Preview] 170727 Sooyoung - Gimpo airport by jtt_muk https://t.co/uKrcF8lmNe',
'retweet'),
('[Preview] 170727 Seohyun - Gimpo airport by mr_zhang https://t.co/5gAgwJZudT',
'retweet'),
('[PRESS] 170726 Mark at Gimpo airport departures to Tokyo, Japan (2) https://t.co/urWfCgwWJZ',
'tweet'),
('170726 Gimpo airport to Tokyo - Taeyong press pics #NCT127 #태용 https://t.co/qV3jfRdFuy',
'retweet'),
('170727 Heechul, Shindong at Gimpo Airport going to Japan for #SMTOWNLIVEinTokyo https://t.co/YWHSgX1XZW',
'retweet'),
('170727 Gimpo Airport\n\n#KimHeechul #Heechul #김희철 #희철\n\n[特纸SAMA和希大人一起niconiconi] https://t.co/OJgH9EKSSC',
'retweet')]
[(x.all_text, x.tweet_type) for x in spike_results_list_726 if "utah" in x.all_text.lower()][0:10]
[('#News by #almalki : Four aboard small plane die in crash on Utah highway https://t.co/sizPfmclq7',
'retweet'),
('#News by #almalki : Four aboard small plane die in crash on Utah highway https://t.co/sizPfmclq7',
'retweet'),
('#News by #almalki : Four aboard small plane die in crash on Utah highway https://t.co/sizPfmclq7',
'retweet'),
('#News by #almalki : Four aboard small plane die in crash on Utah highway https://t.co/sizPfmclq7',
'retweet'),
('#News by #almalki : Four aboard small plane die in crash on Utah highway https://t.co/sizPfmclq7',
'retweet'),
('#News by #almalki : Four aboard small plane die in crash on Utah highway https://t.co/sizPfmclq7',
'retweet'),
('#News by #almalki : Four aboard small plane die in crash on Utah highway https://t.co/sizPfmclq7',
'retweet'),
('#News by #almalki : Four aboard small plane die in crash on Utah highway https://t.co/sizPfmclq7',
'retweet'),
('#News by #almalki : Four aboard small plane die in crash on Utah highway https://t.co/sizPfmclq7',
'retweet'),
('#News by #almalki : Four aboard small plane die in crash on Utah highway https://t.co/sizPfmclq7',
'retweet')]
-
” operator), which
you should use for exclusions. Use the “-
” operator to exclude
terms that show up in irrelevant spikes, to exclude certain kinds of
Tweets (say, exclude Retweets), or anything else that might mark a
Tweet as irrelevant (spammy hashtags, news articles that aren’t
interesting, etc).# expand based on what you see (these are top terms that didn't seem relevant, look at the frequent terms we saw)
_rule_b = """
(plane OR landed OR airport OR takeoff OR #wheelsup)
-is:retweet
-trump
-harry
-drowned
-g20
-lawyer
-gimpo
-fuck
"""
rule_b = gen_rule_payload(_rule_b,
from_date="2017-07-01",
to_date="2017-08-01",
results_per_call=500)
results_list = collect_results(rule_b, max_results=500, result_stream_args=search_args)
# look at frequent terms again
get_frequent_terms([x.all_text for x in results_list], stop_words = "english", ngram_range = (2,3)).head(20)
count | |
---|---|
token | |
international airport | 26 |
just landed | 11 |
airport worker | 10 |
easyjet passenger | 10 |
airport security | 9 |
plane crashes | 8 |
plane crash | 8 |
punches easyjet passenger | 7 |
punches easyjet | 7 |
passenger holding | 7 |
crashes mississippi killing | 6 |
mississippi killing | 6 |
crashes mississippi | 6 |
military plane crashes | 6 |
plane crashes mississippi | 6 |
mississippi killing people | 6 |
military plane | 6 |
killing people | 6 |
plane ride | 5 |
new york | 5 |
# use the same "_rule" string
print("Recall, our rule is: {}".format(_rule_b))
count_rule_b = gen_rule_payload(_rule_b,
from_date="2017-07-01",
to_date="2017-08-01",
count_bucket="hour")
counts_list = collect_results(count_rule_b, max_results=24*31, result_stream_args=search_args)
# plot a timeseries of the Tweet counts
tweet_counts = (pd.DataFrame(counts_list)
.assign(time_bucket = lambda x: pd.to_datetime(x["timePeriod"]))
.drop("timePeriod",axis = 1)
.set_index("time_bucket")
.sort_index())
fig, axes = plt.subplots(nrows=1, ncols=2, figsize = (15,5))
(tweet_counts
.plot(ax=axes[0],
title = "Count of Tweets per hour", legend = False));
(tweet_counts
.resample("D")
.sum()
.plot(ax=axes[1],
title="Count of Tweets per day",
legend=False));
Recall, our rule is:
(plane OR landed OR airport OR takeoff OR #wheelsup)
-is:retweet
-trump
-harry
-drowned
-g20
-lawyer
-gimpo
-fuck
Remember when we noted that the Tweet timeseries around flying is unlikely to have big, irregular spikes? It seems like we got rid of most of them, save one or two. Let’s look into those last spikes and exclude those too.
# let's look at the highest volume day
spike_rule_711 = gen_rule_payload(_rule_b,
from_date="2017-07-11",
to_date="2017-07-12",
)
# force results to evaluate (this step actually makes the API calls)
spike_results_list_711 = collect_results(spike_rule_711, max_results=500, result_stream_args=search_args)
get_frequent_terms([x.all_text for x in spike_results_list_711],
stop_words="english",
ngram_range = (1,1)).head(15)
count | |
---|---|
token | |
plane | 244 |
airport | 195 |
crash | 62 |
landed | 48 |
mississippi | 42 |
marine | 39 |
international | 38 |
i'm | 37 |
killed | 35 |
vermont | 29 |
taxiway | 25 |
flight | 23 |
just | 22 |
san | 21 |
lands | 19 |
We’ve talked about using frequent terms to identify what a corpus of Tweets is about, but I want to mention frequent “n-grams” as a slightly more sophisticated alternative.
n-gram: a sequence of n tokens from a document
Using frequent n-grams, you might be able to identify when words show together, in the same sequence surprisingly often - potentially indicating spam, promotional material, memes, lyrics, or reposted content. Pay attention to words that appear together, and you can exclude an n-gram by excluding an “exact phrase” from your Search query.
# let's look at 2- and 3- grams
get_frequent_terms([x.all_text for x in spike_results_list_711],
stop_words = "english",
ngram_range = (2,3)).head(20)
count | |
---|---|
token | |
plane crash | 58 |
international airport | 37 |
mississippi plane | 30 |
mississippi plane crash | 30 |
killed mississippi | 29 |
marine vermont killed | 29 |
marine vermont | 29 |
vermont killed | 29 |
vermont killed mississippi | 29 |
killed mississippi plane | 29 |
san francisco | 17 |
nearly lands | 14 |
military plane | 12 |
lands crowded | 11 |
crowded taxiway | 11 |
plane ticket | 11 |
air canada | 11 |
lands crowded taxiway | 11 |
greatest aviation | 10 |
greatest aviation disaster | 10 |
In this data, we can see that "mississippi plane crash"
,
"killed mississippi plane "
, “vermont killed mississippi” all show
up frequently, and they all show up the same number of times. Let me
find one of those Tweets to look at:
[x.all_text for x in spike_results_list_711 if "vermont" in x.all_text.lower()][0:5]
['Marine from Vermont killed in Mississippi plane\xa0crash https://t.co/RGnRfCI0lg',
'KMBC @kmbc\n Marine from Vermont killed in Mississippi plane crash https://t.co/DirY01X2Pd https://t.co/QTaRr1JkNQ',
'Pittsburgh News Marine from Vermont killed in Mississippi plane crash https://t.co/dk71SLzkm5 https://t.co/JfgXcdO1yR',
'Marine from Vermont killed in Mississippi plane crash https://t.co/HvjlG7vp6W https://t.co/Ij4MCBBFQn',
'Marine from Vermont killed in Mississippi plane crash https://t.co/7hSTiZxMmX https://t.co/irB0fdy6Vs']
Doesn’t have much to do with the experience of flying in a plane. Let’s exclude it. In this case, we might exclude these Tweets pretty precisely by excluding “Marine” (it’s good to pick specific, unambiguous terms for exclusions, so that you don’t do anything too broad), but for this example we’ll exclude “plane crash”–it’s precise enough, will exclude other similar news-type content, and we can demonstrate exact phrase matching.
Add this to my rule: -"plane crash"
# expand based on what you see
_rule_c = """
(plane OR landed OR airport OR takeoff OR #wheelsup)
-is:retweet
-trump
-harry
-drowned
-g20
-lawyer
-gimpo
-fuck
-"plane crash"
"""
spike_rule_704 = gen_rule_payload(_rule_c,
from_date="2017-07-04",
to_date="2017-07-05",
)
spike_results_list_704 = collect_results(spike_rule_704, max_results=500, result_stream_args=search_args)
get_frequent_terms([x.all_text for x in spike_results_list_704], stop_words = "english", ngram_range = (2,3)).head(20)
count | |
---|---|
token | |
hampton airport | 39 |
citizens east hampton | 39 |
east hampton | 39 |
east hampton airport | 39 |
citizens east | 39 |
international airport | 30 |
airport james | 29 |
hampton airport james | 29 |
james barron | 29 |
airport james barron | 29 |
barron nyt | 26 |
james barron nyt | 26 |
new york | 17 |
#android #gameinsight | 16 |
airport city | 15 |
new york times | 14 |
york times | 14 |
nyt new york | 13 |
nyt new | 13 |
barron nyt new | 13 |
Now, it’s not the top term, but when a specific hashtag shows up with some frequency, it’s time take a look. Hashtags are often used to group together topics, and by filtering the hashtag we could catch irrelevant topics. Let’s look at “#gameinsight.”
# try searching on the hashtag:
[x.all_text for x in spike_results_list_704 if "gameinsight" in x.hashtags][0:5]
['¡He completado la misión Día Desafortunado en el juego Airport-City! https://t.co/Ya2oVyXXHo #android #gameinsight',
'¡He alcanzado el nivel 8 en el juego Airport-City! https://t.co/Ya2oVyXXHo #android #gameinsight',
'New plane in my Airport City: Turboprop! https://t.co/Ky4J5dMyBT #iOS #gameinsight',
"I've completed Party Like You Mean It quest in Airport City! https://t.co/QaJ1PIehi7 #android #gameinsight",
"I've completed A Chinese Torture quest in Airport City! https://t.co/QaJ1PIehi7 #android #gameinsight"]
Well, that looks like spam. Let’s exclude a hashtag this time: it’s specific, seems to appear in this content, and easy to exclude. Use the hashtag and negation operator in Search like this:
Add -#gameinsight
Also, it’s not clear that “gameinsight” is all of this spike. Try searching for “hampton” too (as it is one of the most common terms):
[x.all_text for x in spike_results_list_704 if "hampton" in x.all_text.lower()][0:5]
['The Citizens of East Hampton v. Its Airport https://t.co/xDRiyor0Ej',
'"The Citizens of East Hampton v. Its Airport" https://t.co/Fm6NOXqRen',
'"The Citizens of East Hampton v. Its Airport" by JAMES BARRON via NYT https://t.co/LAMg6kHkZN https://t.co/pHhMKTw8Kx',
'#3Novices : The Citizens of East Hampton v. Its Airport The Supreme Court’s refusal last week to review local restrictions on flights is th…',
'"The Citizens of East Hampton v. Its Airport" by JAMES BARRON via NYT https://t.co/eaW8rdnj4M']
Now, could add: -"citizens of east hampton"
and exclude this one
news story, but we might have identified a larger pattern.
URL matching
News stories about planes, crashes, etc are showing up a lot, and we do not want to see any of them in the final dataset (again, this is absolutely a choice, and it will eliminate a lot of data, for better or for worse). We’re going to choose to exclude all Tweets with certain big news URLs included in them. We’re not going to get this perfect here, but we will see how to use the URL matching rule, and give you more ideas about how to filter Tweets.
The url:<token>
operator performs a tokenized match on words in the
unrolled URL that was posted in the Tweet. We’ll eliminate Tweets with:
“nytimes”,”bbc”,”washingtonpost”, “cbsnews”, “reuters”, “apnews”, and
“news” in the URL.
Add:
-url:nytimes -url:bbc -url:washingtonpost -url:cbsnews -url:reuters -url:apnews -url:news
# expand based on what you see
_rule_d = """
(plane OR landed OR airport OR takeoff OR #wheelsup)
-is:retweet
-trump
-harry
-drowned
-g20
-lawyer
-gimpo
-fuck
-"plane crash"
-"citizens of east hampton"
-url:nytimes -url:bbc -url:washingtonpost -url:cbsnews -url:reuters -url:apnews -url:news
"""
# finally, let's look at that high volume hour on July 3rd
tweet_counts.sort_values(by = "count", ascending = False).head(2)
count | |
---|---|
time_bucket | |
2017-07-03 18:00:00 | 5058 |
2017-07-03 19:00:00 | 4458 |
spike_rule_703 = gen_rule_payload(_rule_d, #<-remember, same rule
from_date="2017-07-03T18:00",
to_date="2017-07-03T19:01",
)
# force results to evaluate (this step actually makes the API calls)
spike_results_list_703 = collect_results(spike_rule_703, max_results=500, result_stream_args=search_args)
get_frequent_terms([x.all_text for x in spike_results_list_703], stop_words = "english", ngram_range = (2,3)).head(20)
count | |
---|---|
token | |
boston airport | 133 |
near boston | 84 |
near boston airport | 77 |
logan airport | 69 |
pedestrians near | 57 |
airport injured | 52 |
vehicle near boston | 52 |
vehicle near | 52 |
pedestrians struck | 50 |
pedestrians struck vehicle | 50 |
struck vehicle | 50 |
struck vehicle near | 50 |
boston airport injured | 44 |
international airport | 35 |
boston's logan | 34 |
boston's logan airport | 33 |
people injured | 32 |
strikes crowd | 31 |
car strikes | 30 |
near logan | 30 |
# read some Tweets
[x.all_text for x in spike_results_list_703 if "Boston" in x.all_text][0:10]
['#gossip Car Strikes Crowd at East Boston Airport, Multiple People Injured https://t.co/lK1EIbm9OJ',
'Tru Town Films Car Strikes Crowd at East Boston Airport, Multiple People Injured https://t.co/eUy8dWqXuD',
'Car Strikes Crowd at East Boston Airport, Multiple People Injured https://t.co/4YcPZ3QkZA',
'Car Strikes Crowd at East Boston Airport, Multiple People\xa0Injured https://t.co/yyijNbPa4p',
'Pedestrians struck by vehicle near Boston airport, several injuries reported https://t.co/yOnQ9bGWBk #p2 #ctl https://t.co/ugVOtnwzLs\n\n— …',
"The Times Picayune - Vehicle hits pedestrians near Boston's Logan Airport: report https://t.co/BpDCYg2i5R",
'Police: Boston airport crash injures 10 pedestrians https://t.co/YlGlcHFd8u',
'Update: Taxi incident near Boston airport being treated as accident, not terrorism, law enforcement sources say https://t.co/kgvXLkd4Zk',
'Car Strikes Crowd at East Boston Airport, Multiple People Injured https://t.co/PfJ7xhgPLT',
'Taxi strikes pedestrians near Boston airport, police\xa0say https://t.co/gKwp3D5sA9']
We don’t want to eliminate all mentions of “Boston Airport” (because many of those mentions are likely relevant to my study). Instead, we’ll eliminate a few common phrases like “pedestrians struck”, “pedestrians injured”, “car strikes”, “taxi strikes”.
Add:
-"pedestrians struck" -"pedestrians injured" -"car strikes" -"taxi strikes" -"car hits"
# expand based on what you see
_rule_e = """
(plane OR landed OR airport OR takeoff OR #wheelsup)
-is:retweet
-trump
-harry
-drowned
-g20
-lawyer
-gimpo
-fuck
-"plane crash"
-#gameinsight
-"citizens of east hampton"
-url:nytimes -url:bbc -url:washingtonpost -url:cbsnews -url:reuters -url:apnews -url:news
-"pedestrians struck" -"pedestrians injured" -"car strikes" -"taxi strikes" -"car hits"
"""
rule_e = gen_rule_payload(_rule_e,
from_date="2017-07-01",
to_date="2017-08-01",
)
results_list = collect_results(rule_e, max_results=500, result_stream_args=search_args)
print("Recall, our new rule is: {}".format(_rule_e))
count_rule_e = gen_rule_payload(_rule_e,
from_date="2017-07-01",
to_date="2017-08-01",
count_bucket="hour")
counts_list = collect_results(count_rule_e, max_results=24*31, result_stream_args=search_args)
# plot a timeseries of the Tweet counts
tweet_counts = (pd.DataFrame(counts_list)
.assign(time_bucket = lambda x: pd.to_datetime(x["timePeriod"]))
.drop("timePeriod",axis = 1)
.set_index("time_bucket")
.sort_index()
)
fig, axes = plt.subplots(nrows=1, ncols=2, figsize = (15,5))
_ = tweet_counts.plot(ax=axes[0], title = "Count of Tweets per hour", legend = False)
_ = tweet_counts.resample("D").sum().plot(ax=axes[1], title = "Count of Tweets per day", legend = False)
Recall, our new rule is:
(plane OR landed OR airport OR takeoff OR #wheelsup)
-is:retweet
-trump
-harry
-drowned
-g20
-lawyer
-gimpo
-fuck
-"plane crash"
-#gameinsight
-"citizens of east hampton"
-url:nytimes -url:bbc -url:washingtonpost -url:cbsnews -url:reuters -url:apnews -url:news
-"pedestrians struck" -"pedestrians injured" -"car strikes" -"taxi strikes" -"car hits"
Let’s compare the volume of our first (unrefined) rule to our most recent one.
Spoiler: We’ve eliminated a huge amount of unhelpful data!
Now you can begin to think about using this data in an analysis.
fig, axes = plt.subplots(nrows=1, ncols=1, figsize = (15,5))
tweet_counts_original.resample("D").sum().plot(ax=axes, title = "Count of Tweets per day", legend = False)
tweet_counts.resample("D").sum().plot(ax=axes, title = "Count of Tweets per day", legend = False)
plt.legend(["original", "refined"]);
get_frequent_terms([x.all_text for x in results_list], stop_words = "english", ngram_range = (1,2)).head(20)
count | |
---|---|
token | |
airport | 247 |
plane | 173 |
landed | 64 |
i'm | 36 |
just | 35 |
international | 32 |
international airport | 30 |
people | 21 |
like | 21 |
time | 18 |
don't | 17 |
security | 17 |
flight | 16 |
takeoff | 15 |
got | 14 |
passenger | 14 |
today | 13 |
know | 13 |
hours | 12 |
just landed | 12 |
It’s always possible to continue refining a dataset, and we should definitely continue that work after pulling the Tweets. The key in early data-cleaning steps is to eliminate a large bulk of irrelevant Tweets that you don’t want to store or pay for (we did that).
Now (warning! this will use quite a few API calls and you might want to think twice before actually running it. The cell type is markdown so you can’t run it on accident) let’s pull data with our final rule.
Before you decide to pull this data, we’ll check how many API calls
we’ve used in this notebook (the ResultStream
object provides a
convenience variable for this).
# count API calls
print("You have used {} API calls in this session".format(ResultStream.session_request_counter))
You have used 12 API calls in this session
Store the data you pull!
You don’t want to pull this much data without saving it for later. We’re
going to use the ResultStream
object to make API calls, stream data
in an iterator, hold relevant data information in memory in my Python
session, and (importantly!) stream raw Tweets to a file for later use.
Even if you don’t want to run these exact API calls, pay attention to how this is done for your own work.
Finalize your rule
# this is our final rule
print(_rule_e)
(plane OR landed OR airport OR takeoff OR #wheelsup)
-is:retweet
-trump
-harry
-drowned
-g20
-lawyer
-gimpo
-fuck
-"plane crash"
-#gameinsight
-"citizens of east hampton"
-url:nytimes -url:bbc -url:washingtonpost -url:cbsnews -url:reuters -url:apnews -url:news
-"pedestrians struck" -"pedestrians injured" -"car strikes" -"taxi strikes" -"car hits"
# I can sum up the counts endpoint results to guess how may Tweets I'll get
# (or, if you don't have access to the counts endpoints, try extrapolating from a single day)
tweet_counts.sum()
count 1572126
dtype: int64
Narrowing your dataset further
You might refine your rule and find that there are still millions of Tweets about a topic on Twitter (this happens, there are lots of Tweets).
If your rule is still going to pull an unrealistic number of Tweets, maybe narrow the scope of your investigation. You can: - Narrow the time period: do you really need a month of data? or maybe you could sample just a few days? - Select more specific Tweets: maybe it would be helpful to your analysis to have only Tweets that are geo tagged (this reduces data volume significantly)? Or only pull Tweets from users with profile locations? Putting stringent requirements on data can help speed up your analysis and make your data volume smaller. - Sampling: Search API doesn’t support random sampling the way that the Historical APIs do. You’ll have to come up with balanced ways of selecting for less data, or you can sample based on geography, time, Tweet language–anything to narrow your scope.
I’m going to show one example of this by only pulling Tweets from users with a profile location (you can read up more on what a “profile location” means in our documentation) in the state of Colorado (this is for illustrative purposes, depending on your use case, we’d probably recommend narrowing the time scope first).
Add: profile_country:US profile_region:Colorado
to my Twitter Search
rule.
final_rule = """
(plane OR landed OR airport OR takeoff OR #wheelsup)
-is:retweet
-trump
-harry
-drowned
-g20
-lawyer
-gimpo
-fuck
-"plane crash"
-#gameinsight
-"citizens of east hampton"
-url:nytimes -url:bbc -url:washingtonpost -url:cbsnews -url:reuters -url:apnews -url:news
-"pedestrians struck" -"pedstrians injured" -"car strikes" -"taxi strikes" -"car hits"
profile_country:US profile_region:"Colorado"
"""
count_rule = gen_rule_payload(final_rule,
from_date="2017-07-01",
to_date="2017-08-01",
count_bucket="hour")
counts_list = collect_results(count_rule, max_results=24*31, result_stream_args=search_args)
# plot a timeseries of the Tweet counts
tweet_counts = (pd.DataFrame(counts_list)
.assign(time_bucket = lambda x: pd.to_datetime(x["timePeriod"]))
.drop("timePeriod",axis = 1)
.set_index("time_bucket")
.sort_index()
)
fig, axes = plt.subplots(nrows=1, ncols=2, figsize = (15,5))
_ = tweet_counts.plot(ax=axes[0], title = "Count of Tweets per hour", legend = False)
_ = tweet_counts.resample("D").sum().plot(ax=axes[1], title = "Count of Tweets per day", legend = False)
tweet_counts.sum()
count 7225
dtype: int64
Seems like a reasonable number of Tweets. Let’s go get them.
This time, we’re going to use the ResultStream object and stream the full Tweet payloads to a file while simultaneously creating a Pandas DataFrame in memory of the limited Tweet fields that we care about.
If you actually want to run the cell below, you’ll have to set the cell type to “code.”
final_rule_payload = gen_rule_payload(final_rule,
from_date="2017-07-01",
to_date="2017-08-01")
stream = ResultStream(**search_args,
rule_payload=final_rule_payload,
max_results=None) # should collect all of the results
# write_ndjson is a utility function that writes the results to a file and passes them through the iterator
from searchtweets.utils import write_ndjson
limited_fields = []
for x in write_ndjson("air_travel_data.json", stream.stream()):
limited_fields.append({"date": x.created_at_datetime,
"text": x.all_text,
"user": x.screen_name,
"bio": x.bio,
"at_mentions": [u["screen_name"] for u in x.user_mentions],
"hashtags": x.hashtags,
"urls": x.most_unrolled_urls,
"geo": x.geo_coordinates,
"type": x.tweet_type,
"id": x.id})
# create a dataframe
final_dataset_df = pd.DataFrame(limited_fields)
final_dataset_df.head()
All Tweet data collection should be focused on a question, and our focused on finding out what people were talking about while they flew on airplanes. You can use the same basic steps that we used here to answer your own business questions using Twitter data.
searchtweets
package
and how to write rules to get the Tweets that you are looking for.Tweet
objects
(provided by the tweet_parser
package), and
Tweet payload
elements.