Attended Events Data Science Guides¶
Part 1: Data Engineering¶
PredictHQ's Attended Events are scheduled to occur at a specific venue and usually depend on attendance. This How to Series allows you to quickly extract the data (Part 1), explore the data (Part 2) and experiment with different aggregations (Part 3).
A How To Guide to extracting data from PredictHQ's Attended Events data (conferences, expos, concerts, festivals, performing-arts, sports, community).
This notebook will guide you through how to extract Attended Events for a location and time of your choice.
Setup¶
If using Google Colab uncomment the following code block.
# %%capture
# !git clone https://github.com/predicthq/phq-data-science-docs.git
# %cd phq-data-science-docs/attended-events
# !pip install predicthq timezonefinder
If running locally, set up a Python environment using requirements.txt
shared alongside the notebook to install the required dependencies.
import pandas as pd
from predicthq import Client
from timezonefinder import TimezoneFinder
import requests
# To display more columns and with a larger width in the DataFrame
pd.set_option("display.max_columns", 50)
pd.options.display.max_colwidth = 100
Access Token¶
An Access Token is required to query the API.
The following link will guide you through creating an account and an access token.
# Replace Access Token with own access token.
ACCESS_TOKEN = 'REPLACE_WITH_ACCESS_TOKEN'
phq = Client(access_token=ACCESS_TOKEN)
SDK Parameters¶
To search for Attended Events, start by building a parameter dictionary and adding the required filters.
parameters = dict()
Location¶
Observances, public holidays and school holidays change by location. Specifying the location ensures you will see the relevant events for the location.
The notebook provides a default location in Chicago in Northeastern Illinois.
This can be adjusted to suit a location that is of interest to you.
We can do this in two ways:
1) Using the within
parameter, which contains latitude
, longitude
of the interested location with a radius
and a unit
for the radius.
2) Using a list of place_id
values.
# Using latitude, longitude and a radius
latitude, longitude = (41.881832, -87.623177)
radius = 5
radius_unit = "km"
within = f"{radius}{radius_unit}@{latitude},{longitude}"
Alternatively, we could have used a list of place_id
for our search (See our Appendix on place_ids for detailed explanation).
# Using a list of place_id
place_ids = [4887398]
You can use either within
or place_id
as a filter but you cannot use both.
parameters.update(within=within) # Comment if you want to use place_ids
# parameters.update(place__scope=place_ids) # Keep commented if you want to use lat and long
Date "YYYY-MM-DD"¶
To define the period of time for which Attended Events will be returned, set the greater than or equal (active__gte
) and less than or equal (active__lte
) parameters. This will select all Attended Events that are active within this period.
You could also use either of these parameters depending on your time period of interest:
gte - Greater than or equal.
gt - Greater than.
lte - Less than or equal.
lt - Less than.
The default example in this notebook is to search for events from 2019 to 2021.
start_time = "2019-01-01"
end_time = "2021-12-31"
parameters.update(active__gte=start_time)
parameters.update(active__lte=end_time)
Timezone¶
By using the timezone of the location of interest, the dates provided in the active
parameter are first converted to the local timezone. This ensures appropriate events will be returned. Timezones are in TZ Database format.
For our Los Angeles example, the timezone is America/Chicago
.
You can use TimezoneFinder()
to find the timezone of your location of interest.
timezone = TimezoneFinder().timezone_at(lat=latitude, lng=longitude)
print(timezone)
America/Chicago
parameters.update(active__tz=timezone)
Rank range¶
Similar to the date period, the rank range can be set to filter events. The rank_type can be set to either rank
, local_rank
or aviation_rank
. The rank
reflects the estimated impact of an event. The local_rank
reflects the estimated impact of an event in their local area by considering population density. Local Rank is useful when comparing events from multiple locations. The aviation_rank
reflects the number of passengers who attend the event by flight. As a rule of thumb, here is the estimation of the number of attendance/passengers for typical rank_type and rank_threshold settings:
rank_type | rank_threshold | Number of attendance/passengers |
---|---|---|
rank | $20$ | $\sim30$ |
rank | $30$ | $\sim100$ |
rank | $40$ | $\sim300$ |
rank | $50$ | $\sim1000$ |
rank | $60$ | $\sim3000$ |
aviation_rank | $30$ | $\sim20$ |
aviation_rank | $40$ | $\sim40$ |
aviation_rank | $50$ | $\sim100$ |
aviation_rank | $60$ | $\sim200$ |
# Select events according to rank_type, rank threshold.
rank_type = "rank" # Set to be either "rank", "local_rank" or "aviation_rank".
rank_threshold = 40
filter_parameter = "gte"
parameters.update({f"{rank_type}__{filter_parameter}": rank_threshold})
Categories¶
Specify a list of Attended Events categories to return.
categories = [
"conferences",
"expos",
"concerts",
"festivals",
"performing-arts",
"sports",
"community",
]
parameters.update(category=categories)
Limits¶
When pulling historical data for a large time period, many results are returned. To speed up the execution, set limit to the highest available setting (500). By doing this, each call to the API returns 500 results and this will generally speed up the time to retrieve large datasets.
parameters.update(limit=500)
Checking the parameters¶
Finally, let's take a look at the parameters we have set for our search.
parameters
{'within': '5km@41.881832,-87.623177', 'active__gte': '2019-01-01', 'active__lte': '2021-12-31', 'active__tz': 'America/Chicago', 'rank__gte': 40, 'category': ['conferences', 'expos', 'concerts', 'festivals', 'performing-arts', 'sports', 'community'], 'limit': 500}
You can check out the full list of available parameters to query our Events API for Attended Events on our Events API Resource page.
Calling the PredictHQ API and Fetching Events¶
In this step, we use PHQ's Python SDK Client to query and fetch events using the parameters we defined above.
results = []
# Iterating through all the events that match our criteria and adding them to our results
for event in phq.events.search(parameters).iter_all():
results.append(event.to_dict())
# Converting the results to a DataFrame
event_df = pd.DataFrame(results)
Exploring the Result DataFrame and Storing it¶
We take a look at the result data and select the most important fields for our use case.
event_df.head()
cancelled | category | country | deleted_reason | description | duplicate_of_id | duration | end | first_seen | id | labels | location | place_hierarchies | postponed | relevance | scope | start | state | timezone | title | updated | aviation_rank | brand_safe | entities | local_rank | phq_attendance | predicted_end | private | rank | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | None | performing-arts | US | None | None | 0 | 2022-01-01 01:30:00+00:00 | 2020-06-28 11:51:35+00:00 | HjogymBLsUremag8sA | [entertainment, music, performing-arts] | [-87.622225, 41.89819] | [[6295630, 6255149, 6252001, 4896861, 4888671, 4887398]] | None | 1.0 | locality | 2022-01-01 01:30:00+00:00 | active | America/Chicago | SIX (Chicago) | 2020-10-10 00:45:31+00:00 | 0.0 | None | [{'entity_id': 'HFJ57uTXBSMQpDkX95u9GK', 'name': 'Broadway Playhouse', 'type': 'venue', 'formatt... | 45 | 331 | NaT | False | 40 | |
1 | None | performing-arts | US | None | None | 0 | 2022-01-01 00:30:00+00:00 | 2020-09-29 15:22:48+00:00 | PaKQgbAMYEnCPFfCXk | [entertainment, music, performing-arts] | [-87.633038, 41.884197] | [[6295630, 6255149, 6252001, 4896861, 4888671, 4887398]] | None | 1.0 | locality | 2022-01-01 00:30:00+00:00 | active | America/Chicago | Disney’s Frozen (Chicago) | 2020-11-06 03:22:52+00:00 | NaN | None | [{'entity_id': '3wyTQaVW5h4JJGw9zQ5TvZ', 'name': 'Cadillac Palace Theatre', 'type': 'venue', 'fo... | 63 | 1505 | NaT | False | 54 | |
2 | None | performing-arts | US | None | None | 0 | 2021-12-31 20:00:00+00:00 | 2020-05-22 17:27:36+00:00 | fhhNm3mWf2ThKvhjYJ | [entertainment, music, performing-arts] | [-87.622225, 41.89819] | [[6295630, 6255149, 6252001, 4896861, 4888671, 4887398]] | None | 1.0 | locality | 2021-12-31 20:00:00+00:00 | active | America/Chicago | SIX (Chicago) | 2020-10-10 00:49:52+00:00 | 0.0 | None | [{'entity_id': 'HFJ57uTXBSMQpDkX95u9GK', 'name': 'Broadway Playhouse', 'type': 'venue', 'formatt... | 45 | 331 | NaT | False | 40 | |
3 | None | performing-arts | US | None | Delivery delay until September 27th, 2019. Single Ticket purchases are limited to 9 tickets per ... | None | 0 | 2021-12-31 19:00:00+00:00 | 2019-09-06 20:08:26+00:00 | cT3DVGjWRrMKaTGauR | [entertainment, music, performing-arts] | [-87.633038, 41.884197] | [[6295630, 6255149, 6252001, 4896861, 4888671, 4887398]] | None | 1.0 | locality | 2021-12-31 19:00:00+00:00 | active | America/Chicago | Disney’s Frozen (Chicago) | 2020-08-13 16:04:41+00:00 | NaN | None | [{'entity_id': '3wyTQaVW5h4JJGw9zQ5TvZ', 'name': 'Cadillac Palace Theatre', 'type': 'venue', 'fo... | 63 | 1505 | NaT | False | 54 |
4 | None | performing-arts | US | None | None | 0 | 2021-12-31 01:30:00+00:00 | 2020-02-15 23:09:53+00:00 | KHYby2vDe9bfgdVm9V | [entertainment, music, performing-arts] | [-87.622225, 41.89819] | [[6295630, 6255149, 6252001, 4896861, 4888671, 4887398]] | None | 1.0 | locality | 2021-12-31 01:30:00+00:00 | active | America/Chicago | SIX (Chicago) | 2021-03-17 07:08:21+00:00 | 0.0 | None | [{'entity_id': 'HFJ57uTXBSMQpDkX95u9GK', 'name': 'Broadway Playhouse', 'type': 'venue', 'formatt... | 49 | 331 | NaT | False | 40 |
It is important to understand the output data. The most useful fields are the following:
id
The unique id of the event.title
The title of the event.description
The description of the event.duration
Duration of the event in seconds.start
The start time of the event.end
The end time of the event.predicted_end
The estimated end time of the event.first_seen
The time when we first received the event.category
The category the event belongs to. e.g. school-holidays, public-holidays, observances.labels
Labels of the event.location
Latitude and longitude of the event.place_hierarchies
The hierarchies of place_ids for places the event is located in.timezone
The timezone of the event's location.entities
The entities associated with the event.phq_attendance
The number of people expected to attend the event.rank
PHQ rank of the event.local_rank
Local rank of the event.aviation_rank
Aviation rank of the event.
# Selecting the target fields
event_df = event_df[
[
"id",
"title",
"description",
"duration",
"start",
"end",
"predicted_end",
"first_seen",
"category",
"labels",
"location",
"place_hierarchies",
"timezone",
"entities",
"phq_attendance",
"rank",
"local_rank",
"aviation_rank",
]
]
# Creating a filename for our DataFrame and saving our final DataFrame as a CSV file
if "within" in parameters:
file_name = (
f"radius{radius}{radius_unit}_{latitude}_{longitude}_{start_time}_{end_time}_"
+ f"{rank_type}_{filter_parameter}_{rank_threshold}"
)
else:
file_name = (
f"place_ids_{'_'.join(place_ids)}_{start_time}_{end_time}_"
+ f"{rank_type}_{filter_parameter}_{rank_threshold}"
)
event_df.to_csv(f"data/event_data/{file_name}.csv", index=False)
print(f"DataFrame saved to data/event_data/{file_name}.csv")
DataFrame saved to data/event_data/radius5km_41.881832_-87.623177_2019-01-01_2021-12-31_rank_gte_40.csv
Appendix: Finding place_id
¶
Here is a guide on how to link between store locations and place_id
. We present 3 options:
- Query
place_id
based on location name. Here the location name could be a city, a state, a country or a continent. - Query
place_hierarchies
based on latitude and longitude. - Query location name based on
place_id
.
The full list of parameters that you could use in your query is documented at our Places API page.
PredictHQ uses the geonames places convention.
1) Query place_id
based on location name¶
By using PredictHQ's Places API, you can find the place_id
for a specific location name. By calling the API and setting q
to location name, the API will return the most relevant place_id
. Taking the top place_id
will provide the most relevant place_id
the location name is in.
# Example location names.
locations = ["Chicago", "Cook County", "United States", "North America"]
place_id_lookup = pd.DataFrame()
for location in locations:
response = requests.get(
url="https://api.predicthq.com/v1/places/",
headers={
"Authorization": f"Bearer {ACCESS_TOKEN}",
"Accept": "application/json",
},
params={"q": location},
)
data = response.json()
df = pd.json_normalize(data["results"])
place_id_lookup = place_id_lookup.append(df.iloc[0], ignore_index=True)
place_id_lookup[["id", "name", "type"]]
id | name | type | |
---|---|---|---|
0 | 4887398 | Chicago | locality |
1 | 4888671 | Cook County | county |
2 | 6252001 | United States | country |
3 | 6255149 | North America | continent |
2) Query place_hierarchies
based on latitude and longitude¶
By using PredictHQ's Places Hierarchies endpoint in the Place API, you can find the place_hierarchies
for a specific latitude and longitude. By calling the API and setting location.origin
to latitude,longitude
, the API will return the most relevant place_hierarchies
.
# Example locations.
latitude_longitudes = [[41.881832, -87.623177]]
place_hierarchies_lookup = pd.DataFrame()
for latitude_longitude in latitude_longitudes:
latitude, longitude = latitude_longitude
response = requests.get(
url="https://api.predicthq.com/v1/places/hierarchies",
headers={
"Authorization": f"Bearer {ACCESS_TOKEN}",
"Accept": "application/json",
},
params={"location.origin": f"{latitude},{longitude}"},
)
data = response.json()
df = pd.DataFrame(data)
df["latitude"] = latitude
df["longitude"] = longitude
place_hierarchies_lookup = place_hierarchies_lookup.append(df, ignore_index=True)
place_hierarchies_lookup
place_hierarchies | latitude | longitude | |
---|---|---|---|
0 | [6295630, 6255149, 6252001, 4896861, 4888671, 4887398] | 41.881832 | -87.623177 |
For each latitude,longitude
, the response might include more than one hierarchy. The reason for this is we try to match the closest place's hierarchy but we also include the closest major city's hierarchy within a radius of 50km. This only applies if the level is below region and, if it exists, the major city's hierarchy will always be the second row of the DataFrame.
3) Query location name based on place_id
¶
By using PredictHQ's Places API, you can find the location name for a specific place_id
. By calling the API and setting id
to place_id
, the API will return the most relevant location name. Taking the top location name will provide the most relevant location name the place_id
is in.
# Example place_ids.
place_ids = ["6295630", "6255149", "6252001", "4896861", "4888671", "4887398"]
location_lookup = pd.DataFrame()
for place_id in place_ids:
response = requests.get(
url="https://api.predicthq.com/v1/places/",
headers={
"Authorization": f"Bearer {ACCESS_TOKEN}",
"Accept": "application/json",
},
# The id could be a comma-separated list of place_ids. In this example, the
# events are queried based on each place_id.
params={"id": place_id},
)
data = response.json()
df = pd.json_normalize(data["results"])
location_lookup = location_lookup.append(df.iloc[0], ignore_index=True)
location_lookup[["id", "name", "type"]]
id | name | type | |
---|---|---|---|
0 | 6295630 | Earth | planet |
1 | 6255149 | North America | continent |
2 | 6252001 | United States | country |
3 | 4896861 | Illinois | region |
4 | 4888671 | Cook County | county |
5 | 4887398 | Chicago | locality |