Live TV Events Data Science Guides¶
Part 1: Data Engineering¶
PredictHQ’s Live TV Events data includes viewership prediction for the seven top US sports leagues: NFL, NBA, NHL, MLB, D1 NCAA Basketball, D1 NCAA Football, and MLS. Our TV viewership data is designed for data scientists to improve forecasting at the county and store level. This How to Series allows you to quickly extract the data (Part 1), explore the data (Part 2) and experiment with different aggregations (Part 3).
A How To Guide to extracting data from PredictHQ's Live TV Events.
Setup¶
If using Google Colab uncomment the following code block.
# %%capture
# !git clone https://github.com/predicthq/phq-data-science-docs.git
# %cd phq-data-science-docs/live-tv-events
# !pip install predicthq timezonefinder
If running locally, set up a Python environment using requirements.txt
shared alongside the notebook to install the required dependencies.
import pandas as pd
from datetime import datetime
from datetime import timedelta
from timezonefinder import TimezoneFinder
import numpy as np
import pytz
from predicthq import Client
import requests
Access Token¶
To query the API, you will need an access token with the Broadcasts scope. If you have previously used the PredictHQ API to search and use events, you may still need to create a new access token to query broadcasts.
The following link will guide you through creating an account and access token.
# Replace Access Token with own access token.
ACCESS_TOKEN = 'REPLACE_WITH_ACCESS_TOKEN'
phq = Client(access_token=ACCESS_TOKEN)
Live TV Events data is available through the Broadcasts API.
Events Coverage¶
The Broadcasts API returns each live sports broadcast for the following seven sports leagues in the US:
- NFL
- NBA
- NHL
- MLB
- D1 NCAA Basketball
- D1 NCAA Football
- MLS
(Only live games are included. There are no replays.)
Spacial Granularity¶
Data is available for the United States at a granularity of county level.
Features¶
Each broadcast is provided with predicted viewership at the US county level. Additional data is available about the event, such as physical location and duration.
Date Availability¶
January 1, 2018 to 2 weeks into the future.
Support Functions¶
Each broadcast relates to a physical sports event from the PredictHQ events knowledge graph. Additional data about the actual event is also returned. For example: the league and sport of the broadcast are included within the event's labels field. The following functions make it easier to extract the sport and league for each broadcast.
def extract_matching_label(event_labels, labels_to_match):
"""
For each broadcast the league and sport type need to be
extracted. These labels are extracted from the labels.
As the order of the labels varies this look up is
required to compare to the frozenset of options.
"""
for label in labels_to_match:
if label in event_labels:
return label
return None
SPORTS = frozenset([
'american-football',
'baseball',
'basketball',
'ice-hockey',
'soccer',
])
LEAGUES = frozenset([
'mlb',
'mls',
'nba',
'ncaa',
'nfl',
'nhl',
])
def convert_timezone(row):
"""Convert event predicted end time to broadcast location timezone from the event timezone."""
event_end_naive = row['dates_event']['predicted_end_local']
event_timezone = pytz.timezone(row['dates_event']['timezone'])
event_end_localtime = event_timezone.localize(event_end_naive, is_dst=None)
event_end_utc = event_end_localtime.astimezone(pytz.utc)
broadcast_timezone = row['dates_broadcast']['timezone']
broadcast_end_localtime = event_end_utc.astimezone(pytz.timezone(broadcast_timezone))
row['predicted_end_time_broadcast_local'] = broadcast_end_localtime.replace(tzinfo=None)
return row
SDK Parameters¶
We will create a dictionary of notable parameters and walk through each of the settings to use in the SDK call.
parameters_dict = dict()
Viewership Limits phq_viewership__gte=100
¶
- We recommend filtering for broadcasts with a viewership greater than or equal to 100. This removes the smallest, noisiest broadcasts. This will remove a number of broadcasts. This is customizable to your use case.
parameters_dict.update(phq_viewership__gte=100)
Limits limit=500
¶
- When pulling historical data for a large time period many results are returned. To speed up the execution set
limit
to the highest available setting (500). By doing this each call to the API returns 500 results and this will generally speed up the time to retrieve large datasets.
parameters_dict.update(limit=500)
Location¶
There are two options available to specify a county of interest. The first is to use the place_id of the county. The second is to use a latitude and longitude somewhere within the county. If you provide both, they must both relate to the same county. It is recommended that you only provide one.
location__place_id='place_id'
or location__origin='lat,long'
If you do not know the county's place_id or its latitude and longitude we provide a link between the FIPS code and the place_id in the Appendix.
For the SDK call, you can specify your own counties of interest. Here are four default counties to query as an example:
'Clark County, Nevada': 5501879
'Los Angeles County, California': 5368381
'Cook County, Chicago, Illinois': 4888671
'Harris County, Houston, Texas': 4696376
parameters_dict.update(location__origin='34.05223,-118.24368')
Time Limits start__gte='2019-01-01', start__lte='2021-01-15'
¶
- To define the period of time for which broadcasts will be returned set the greater than or equal
gte
and less than or equallte
parameters for start. This will select all broadcasts that start within this period.
You could also use either of these parameters depending on your time period of interest:
gte - Greater than or equal.
gt - Greater than.
lte - Less than or equal.
lt - Less than.
start__tz
allows you to set the timezone to align with the location of interest. If no start__tz
is provided, UTC is used as default. This can lead to missing broadcasts at the edge of your time period, where they may not fall within the date range based on UTC, but fall within the dates based on the local timezone. The datetime of the broadcast is provided in the local timezone.
parameters_dict.update(start__tz='America/Chicago')
Sources to aid in finding the timezone (tz database).
# timezonefinder will help to easily find a timezone from lat long.
timezone = TimezoneFinder().timezone_at(lat=34.05223, lng=-118.24368)
print(timezone)
America/Los_Angeles
# Set your chosen start and end date.
START_DATE = '2019-01-01'
END_DATE = '2021-02-14'
parameters_dict.update(start__gte=START_DATE, start__lte=END_DATE)
parameters_dict.update(start__tz='America/Los_Angeles')
# For example:
parameters_dict
{'phq_viewership__gte': 100, 'limit': 500, 'location__origin': '34.05223,-118.24368', 'start': {'gte': '2019-01-01', 'lte': '2021-02-14'}, 'start__tz': 'America/Los_Angeles'}
SDK Call¶
Loop through the call to the Broadcasts API for each county of interest.
Not all broadcasts will be returned for each county. For example if a county has low broadcast coverage (less than 45% of the county population have access to the broadcast) the broadcast is not available. Other reasons a broadcast may not appear could be if the phq_viewership
parameter excludes any broadcasts with low numbers. Certain sports events in certain counties are forecast to have low viewership.
The data for each county is saved to csv as an example output. This can be adjusted to work with your own data pipeline.
# To run for your own counties of interest.
# Either replace list of county ids.
# Or replace list of lat and long.
LIST_OF_COUNTIES = [5501879, 5368381, 4888671, 4696376]
LIST_OF_LAT_LONG = ['42.0909,-71.2643', '33.9534,-118.3387', '39.0489,-94.4839', '37.4034,-121.9694']
START_DATE = '2019-01-01'
END_DATE = '2021-02-14'
In the following example:
- The
start__timezone
parameter is not specified. This means thestart__gte
andstart__lte
values provided are treated as UTC by the API when searching. - County latitude and longitudes are used. Uncomment or comment appropriately if using county place_ids.
# Define API parameters.
parameters_dict = dict()
parameters_dict.update(phq_viewership__gte=100)
parameters_dict.update(limit=500)
parameters_dict.update(start__gte=START_DATE, start__lte=END_DATE)
# Loop through each location of interest.
# Example code is provide to either loop through LIST_OF_COUNTIES or LIST_OF_LAT_LONG.
#for place_id in LIST_OF_COUNTIES: # uncomment/comment as required.
for lat_long in LIST_OF_LAT_LONG: # uncomment/comment as required.
#parameters_dict.update(location__place_id=place_id) # uncomment/comment as required.
parameters_dict.update(location__origin=lat_long) # uncomment/comment as required.
search_results = phq.broadcasts.search(parameters_dict).iter_all()
search_results = [result.to_dict() for result in search_results]
df = pd.DataFrame(search_results)
# Extract additional information: 'event' stores the additional
# data about the physical event.
df = df.merge(df['event'].apply(pd.Series),
left_index=True,
right_index=True,
suffixes=('_broadcast', '_event'))
# Extract sport and league from the labels in the nested event data.
df['sport'] = df.labels.apply(extract_matching_label, args=(SPORTS,))
df['league'] = df.labels.apply(extract_matching_label, args=(LEAGUES,))
df['local_start_date'] = (df.dates_broadcast
.apply(
lambda start_dt:
(start_dt['start_local']).date()
)
)
df['county_place_id'] = (df.location_broadcast
.apply(
lambda location:
location['places'][0]['place_id']
)
)
df['local_start_datetime'] = (df.dates_broadcast
.apply(
lambda start_dt:
(start_dt['start_local'])
)
)
# Check for any events without a predicted end time.
# All broadcasts are expected to have a predicted end time.
broadcast_id_no_endtime = [row['broadcast_id'] for _, row in df.iterrows() \
if not row.get('dates_event', {}).get('predicted_end_local')]
# Remove any broadcasts without a predicted end time.
df = df[~df['broadcast_id'].isin(broadcast_id_no_endtime)]
# Convert the predicted end time of the event to broadcast timezone.
df = df.apply(convert_timezone, axis=1)
df['sport_league'] = df['sport'] + '_' + df['league']
# Calculate the duration of the broadcast.
df['duration'] = df['predicted_end_time_broadcast_local'] - df['local_start_datetime']
df['duration_hours'] = df['duration'].dt.seconds/(60*60)
df['total_viewing'] = df['duration_hours'] * df['phq_viewership']
# Save dataframe to csv
county = df['county_place_id'].unique()[0]
df.to_csv('data/tv_events_data/{}_county_raw.csv'.format(county),
index=False)
The returned data is at the broadcast level. Each broadcast returned for the selected counties are those which met the parameters of the SDK call. In Part 2 of this How to Series we will explore this data to understand the key trends. In Part 3 we'll prepare features to be used in a forecasting model.
Output Dataframe¶
It is important to understand the output data.
A key aspect to be familiar with is which data fields relate to the broadcast and which fields relate to the physical sports event that the broadcast is showing. The data that was extracted out of the event
are all related to the physical event.
For absolute clarity in the returned dataframe, the following columns relate to the broadcast:
- broadcast_id
- updated
- dates_broadcast
- location_broadcast
- phq_viewership
- record_status
- broadcast_status
- local_start_date
- local_start_datetime
- county_place_id
- predicted_end_time_broadcast_local
- total_viewing
And the following columns relate to the actual physical event (many of which are relevant additional data about the broadcast):
- event
- event_id
- title
- category
- labels
- dates_event
- location_event
- entities
- phq_attendance
- phq_rank
- local_rank
- aviation_rank
- sport
- league
- sport_league
- duration
- duration_hours
df.head(2)
broadcast_id | updated | dates_broadcast | location_broadcast | phq_viewership | record_status | broadcast_status | event | event_id | title | ... | sport | league | local_start_date | county_place_id | local_start_datetime | predicted_end_time_broadcast_local | sport_league | duration | duration_hours | total_viewing | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 8Cy3b48jHZAPdxbDRbkaQk | 2020-12-03 10:08:04+00:00 | {'start': 2019-01-01 00:00:00+00:00, 'start_lo... | {'geopoint': {'lat': 36.2152, 'lon': -115.0135... | 68084 | active | scheduled | {'event_id': 'S8F7FMsKiiU4q8UF67', 'title': 'N... | S8F7FMsKiiU4q8UF67 | Northwestern Wildcats vs Utah Utes | ... | american-football | ncaa | 2018-12-31 | 5501879 | 2018-12-31 16:00:00 | 2018-12-31 19:20:00 | american-football_ncaa | 0 days 03:20:00 | 3.333333 | 226946.666667 |
1 | 93LTY9p25MRrj39sGYxepc | 2020-12-05 05:41:16+00:00 | {'start': 2019-01-01 00:00:00+00:00, 'start_lo... | {'geopoint': {'lat': 36.2152, 'lon': -115.0135... | 18805 | active | scheduled | {'event_id': 'usqZVdBrXwBLVQfsRG', 'title': 'B... | usqZVdBrXwBLVQfsRG | Boston Celtics vs San Antonio Spurs | ... | basketball | nba | 2018-12-31 | 5501879 | 2018-12-31 16:00:00 | 2018-12-31 18:20:00 | basketball_nba | 0 days 02:20:00 | 2.333333 | 43878.333333 |
2 rows × 29 columns
Appendix: FIPS¶
Either place_id or latitude and longitude can be used as a parameter in the call to the SDK. If neither of these are available we also provide a mapping between FIPS code and place_id.
PredictHQ uses the geonames places convention https://www.geonames.org/
Location FIPS Code¶
# We provide a lookup between FIPS code and place_id. A geoname_id is equivalent to a place_id.
mapping = pd.read_csv('data/geo_data/geoname_to_fips_mapping.csv')
mapping.head()
geoname_id | county_name | county_fips | |
---|---|---|---|
0 | 4047434 | Russell County | 1113 |
1 | 4048080 | Long County | 13183 |
2 | 4048522 | Boone County | 21015 |
3 | 4048572 | Rowan County | 21205 |
4 | 4049189 | Bibb County | 1007 |