searchtweets package

Submodules

searchtweets.api_utils module

Module containing the various functions that are used for API calls, rule generation, and related.

searchtweets.api_utils.gen_rule_payload(pt_rule, results_per_call=None, from_date=None, to_date=None, count_bucket=None, tag=None, stringify=True)[source]

Generates the dict or json payload for a PowerTrack rule.

Parameters:
  • pt_rule (str) – The string version of a powertrack rule, e.g., “beyonce has:geo”. Accepts multi-line strings for ease of entry.
  • results_per_call (int) – number of tweets or counts returned per API
  • This maps to the maxResults search API parameter. (call.) – Defaults to 500 to reduce API call usage.
  • from_date (str or None) – Date format as specified by convert_utc_time for the starting time of your search.
  • to_date (str or None) – date format as specified by convert_utc_time for the end time of your search.
  • count_bucket (str or None) – If using the counts api endpoint, will define the count bucket for which tweets are aggregated.
  • stringify (bool) – specifies the return type, dict or json-formatted str.

Example

>>> from searchtweets.utils import gen_rule_payload
>>> gen_rule_payload("beyonce has:geo",
    ...              from_date="2017-08-21",
    ...              to_date="2017-08-22")
'{"query":"beyonce has:geo","maxResults":100,"toDate":"201708220000","fromDate":"201708210000"}'
searchtweets.api_utils.gen_params_from_config(config_dict)[source]

Generates parameters for a ResultStream from a dictionary.

searchtweets.api_utils.infer_endpoint(rule_payload)[source]

Infer which endpoint should be used for a given rule payload.

searchtweets.api_utils.convert_utc_time(datetime_str)[source]

Handles datetime argument conversion to the GNIP API format, which is YYYYMMDDHHSS. Flexible passing of date formats in the following types:

- YYYYmmDDHHMM
- YYYY-mm-DD
- YYYY-mm-DD HH:MM
- YYYY-mm-DDTHH:MM
Parameters:datetime_str (str) – valid formats are listed above.
Returns:string of GNIP API formatted date.

Example

>>> from searchtweets.utils import convert_utc_time
>>> convert_utc_time("201708020000")
'201708020000'
>>> convert_utc_time("2017-08-02")
'201708020000'
>>> convert_utc_time("2017-08-02 00:00")
'201708020000'
>>> convert_utc_time("2017-08-02T00:00")
'201708020000'
searchtweets.api_utils.validate_count_api(rule_payload, endpoint)[source]

Ensures that the counts api is set correctly in a payload.

searchtweets.api_utils.change_to_count_endpoint(endpoint)[source]

Utility function to change a normal endpoint to a count api endpoint. Returns the same endpoint if it’s already a valid count endpoint. :param endpoint: your api endpoint :type endpoint: str

Returns:the modified endpoint for a count endpoint.
Return type:str

searchtweets.result_stream module

This module contains the request handing and actual API wrapping functionality.

Its core method is the ResultStream object, which takes the API call arguments and returns a stream of results to the user.

class searchtweets.result_stream.ResultStream(endpoint, rule_payload, username=None, password=None, bearer_token=None, extra_headers_dict=None, max_results=500, tweetify=True, max_requests=None, **kwargs)[source]

Bases: object

Class to represent an API query that handles two major functionality pieces: wrapping metadata around a specific API call and automatic pagination of results.

Parameters:
  • username (str) – username for enterprise customers
  • password (str) – password for enterprise customers
  • bearer_token (str) – bearer token for premium users
  • endpoint (str) – API endpoint; see your console at developer.twitter.com
  • rule_payload (json or dict) – payload for the post request
  • max_results (int) – max number results that will be returned from this
  • Note that this can be slightly lower than the total returned (instance.) –
  • the API call - e.g., setting max_results = 10 would return (from) –
  • results, but an API call will return at minimum 100 results. (ten) –
  • tweetify (bool) – If you are grabbing tweets and not counts, use the tweet parser library to convert each raw tweet package to a Tweet with lazy properties.
  • max_requests (int) – A hard cutoff for the number of API calls this
  • will make. Good for testing in sandbox premium environments. (instance) –
  • extra_headers_dict (dict) – custom headers to add

Example

>>> rs = ResultStream(**search_args, rule_payload=rule, max_pages=1)
>>> results = list(rs.stream())
check_counts()[source]

Disables tweet parsing if the count API is used.

execute_request()[source]

Sends the request to the API and parses the json response. Makes some assumptions about the session length and sets the presence of a “next” token.

init_session()[source]

Defines a session object for passing requests.

session_request_counter = 0
stream()[source]

Main entry point for the data from the API. Will automatically paginate through the results via the next token and return up to max_results tweets or up to max_requests API calls, whichever is lower.

Usage:
>>> result_stream = ResultStream(**kwargs)
>>> stream = result_stream.stream()
>>> results = list(stream)
>>> # or for faster usage...
>>> results = list(ResultStream(**kwargs).stream())
searchtweets.result_stream.collect_results(rule, max_results=500, result_stream_args=None)[source]

Utility function to quickly get a list of tweets from a ResultStream without keeping the object around. Requires your args to be configured prior to using.

Parameters:
  • rule (str) – valid powertrack rule for your account, preferably
  • by the gen_rule_payload function. (generated) –
  • max_results (int) – maximum number of tweets or counts to return from
  • API / underlying ResultStream object. (the) –
  • result_stream_args (dict) – configuration dict that has connection
  • for a ResultStream object. (information) –
Returns:

list of results

Example

>>> from searchtweets import collect_results
>>> tweets = collect_results(rule,
                             max_results=500,
                             result_stream_args=search_args)
searchtweets.result_stream.make_session(username=None, password=None, bearer_token=None, extra_headers_dict=None)[source]

Creates a Requests Session for use. Accepts a bearer token for premiums users and will override username and password information if present.

Parameters:
  • username (str) – username for the session
  • password (str) – password for the user
  • bearer_token (str) – token for a premium API user.
searchtweets.result_stream.request(*args, **kwargs)[source]
searchtweets.result_stream.retry(func)[source]

Decorator to handle API retries and exceptions. Defaults to three retries.

Parameters:func (function) – function for decoration
Returns:decorated function

searchtweets.utils module

Utility functions that are used in various parts of the program.

searchtweets.utils.take(n, iterable)[source]

Return first n items of the iterable as a list. Originally found in the Python itertools documentation.

Parameters:
  • n (int) – number of items to return
  • iterable (iterable) – the object to select
searchtweets.utils.partition(iterable, chunk_size, pad_none=False)[source]

adapted from Toolz. Breaks an iterable into n iterables up to the certain chunk size, padding with Nones if availble.

Example

>>> from searchtweets.utils import partition
>>> iter_ = range(10)
>>> list(partition(iter_, 3))
[(0, 1, 2), (3, 4, 5), (6, 7, 8)]
>>> list(partition(iter_, 3, pad_none=True))
[(0, 1, 2), (3, 4, 5), (6, 7, 8), (9, None, None)]
searchtweets.utils.merge_dicts(*dicts)[source]

Helpful function to merge / combine dictionaries and return a new dictionary.

Parameters:dicts (list or Iterable) – iterable set of dictionaries for merging.
Returns:dict with all keys from the passed list. Later dictionaries in the sequence will override duplicate keys from previous dictionaries.
Return type:dict

Example

>>> from searchtweets.utils import merge_dicts
>>> d1 = {"rule": "something has:geo"}
>>> d2 = {"maxResults": 1000}
>>> merge_dicts(*[d1, d2])
{"maxResults": 1000, "rule": "something has:geo"}
searchtweets.utils.write_result_stream(result_stream, filename_prefix=None, results_per_file=None, **kwargs)[source]

Wraps a ResultStream object to save it to a file. This function will still return all data from the result stream as a generator that wraps the write_ndjson method.

Parameters:
  • result_stream (ResultStream) – the unstarted ResultStream object
  • filename_prefix (str or None) – the base name for file writing
  • results_per_file (int or None) – the maximum number of tweets to write
  • file. Defaults to having no max, which means one file. Multiple (per) –
  • will be named by datetime, according to (files) –
  • <prefix>_YYY-mm-ddTHH_MM_SS.json.
searchtweets.utils.read_config(filename)[source]

Reads and flattens a configuration file into a single dictionary for ease of use. Works with both .config and .yaml files. Files should look like this:

search_rules:
    from-date: 2017-06-01
    to-date: 2017-09-01 01:01
    pt-rule: kanye

search_params:
    results-per-call: 500
    max-results: 500

output_params:
    save_file: True
    filename_prefix: kanye
    results_per_file: 10000000

or:

[search_rules]
from_date = 2017-06-01
to_date = 2017-09-01
pt_rule = beyonce has:geo

[search_params]
results_per_call = 500
max_results = 500

[output_params]
save_file = True
filename_prefix = beyonce
results_per_file = 10000000
Parameters:filename (str) – location of file with extension (‘.config’ or ‘.yaml’)
Returns:parsed configuration dictionary.
Return type:dict

Module contents