NON-ATTENDED EVENTS DATA SCIENCE GUIDES¶
Non-Attended Events are events with a start and end date, but are more fluid in impact, such as observances, public holidays and school holidays. This How to Series allows you to quickly extract the data (Part 1), explore the data (Part 2) and experiment with different aggregations (Part 3).
Part 1 Data Engineering¶
A How To Guide to extracting data from PredictHQ's Non-Attended Events data (public-holidays, observances and school-holidays).
This notebook will guide you through how to extract Non-Attended Events for a location and time of your choice.
Setup¶
If using Google Colab uncomment the following code block.
# %%capture
# !git clone https://github.com/predicthq/phq-data-science-docs.git
# %cd phq-data-science-docs/unattended-events
# !pip install predicthq timezonefinder pandas==1.0.5
If running locally, set up a Python environment using requirements.txt
shared alongside the notebook to install the required dependencies.
import pandas as pd
from predicthq import Client
from timezonefinder import TimezoneFinder
import requests
# To display more columns and with a larger width in the DataFrame
pd.set_option("display.max_columns", 50)
pd.options.display.max_colwidth = 100
Access Token¶
An Access Token is required to query the API.
The following link will guide you through creating an account and an access token.
# Replace Access Token with own access token.
ACCESS_TOKEN = 'REPLACE_WITH_ACCESS_TOKEN'
phq = Client(access_token=ACCESS_TOKEN)
SDK Parameters¶
To search for Non-Attended Events, start by building a parameter dictionary and adding the required filters.
parameters = dict()
Location¶
Observances, public holidays and school holidays change by location. Specifying the location ensures you will see the relevant events for the location.
The notebook provides a default location in Los Angeles, California
.
This can be adjusted to suit a location that is of interest to you.
We can do this in two ways:
1) Using within
parameter, which contains latitude
, longitude
of the interested location with a radius
and a unit
for the radius.
2) Using a list of place_ids.
The result is relatively insensitive to the setting of radius as Non-Attended Events have large location scopes. We recommend a default radius of 10.
Note: Radius is a key parameter for retrieving events from other categories such as Attended Events such as concerts and sports games.
# Using latitude, longitude and a radius
latitude, longitude = (34.07, -118.25)
radius = 10
radius_unit = "km"
within = f"{radius}{radius_unit}@{latitude},{longitude}"
Alternatively, we could have used a list of place_id
for our search (See our Appendix on Place IDs for detailed explanation).
# Using a list of place_id
place_ids = [5368361]
You can use either within
or place_id as a filter but you can not use both.
parameters.update(within=within) # Comment if you want to use place_ids
# parameters.update(place__scope=place_ids) # Comment if you want to use lat and long
Date "YYYY-MM-DD"¶
To define the period of time for which Non-Attended Events will be returned, set the greater than or equal (activegte) and less than or equal (activelte) parameters. This will select all Non-Attended Events that are active within this period.
You could also use either of these parameters depending on your time period of interest:
gte - Greater than or equal.
gt - Greater than.
lte - Less than or equal.
lt - Less than.
The default example in this notebook is to search for the whole of 2020.
start_time = "2020-01-01"
end_time = "2020-12-31"
parameters.update(active__gte=start_time)
parameters.update(active__lte=end_time)
Timezone¶
By setting the timezone for the location of interest, the appropriate events will be returned.(tz database)
For our Los Angeles example it is America/Los_Angeles
.
Use the TimezoneFinder()
to find it for our location of interest.
timezone = TimezoneFinder().timezone_at(lat=latitude, lng=longitude)
print(timezone)
America/Los_Angeles
parameters.update(active__tz=timezone)
Categories¶
Specify a list of Non-Attended Events categories to return.
['school-holidays', 'public-holidays', 'observances']
categories = ["school-holidays", "public-holidays", "observances"]
parameters.update(category=categories)
Checking the parameters¶
Finally, let's take a look at the parameters we have set for our search.
parameters
{'within': '10km@34.07,-118.25', 'active__gte': '2020-01-01', 'active__lte': '2020-12-31', 'active__tz': 'America/Los_Angeles', 'category': ['school-holidays', 'public-holidays', 'observances']}
You can check out the full list of available parameters that you could use in querying Non-Attended Events at our Events Resource page.
Calling the PredictHQ API and Fetching Events¶
In this step, we use PHQ Python SDK Client to query and fetch events based on the parameters we defined above.
results = []
# Iterating through all the events that match our criteria and adding them to our results
for event in phq.events.search(parameters).iter_all():
results.append(event.to_dict())
# Converting the results to a DataFrame
event_df = pd.DataFrame(results)
Exploring the Result DataFrame and Storing it¶
We take a look at the result data and select the most important fields for our use case.
event_df.head()
cancelled | category | country | deleted_reason | description | duplicate_of_id | duration | end | first_seen | id | labels | location | place_hierarchies | postponed | relevance | scope | start | state | timezone | title | updated | aviation_rank | brand_safe | entities | local_rank | phq_attendance | predicted_end | private | rank | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | None | observances | US | None | New Year's Eve is the last day of the year in the Gregorian calendar. Many parties to welcome th... | None | 86399 | 2020-12-31 23:59:59+00:00 | 2017-01-04 23:07:03+00:00 | BJmqLY9kQNqw | [holiday, observance] | [-95.712891, 37.09024] | [[6295630, 6255149, 6252001]] | None | 1.0 | country | 2020-12-31 00:00:00+00:00 | active | None | New Year's Eve | 2021-03-03 01:47:16+00:00 | 0 | None | [{'entity_id': 'huZ4EThvNDrFWjjAbDcLy4', 'name': 'New Year's Eve', 'type': 'event-group', 'forma... | None | None | None | False | 90 |
1 | None | observances | US | None | Kwanzaa is a week-long holiday honoring the culture and traditions of African people and their d... | None | 86399 | 2020-12-26 23:59:59+00:00 | 2017-01-04 23:06:54+00:00 | QDPqxb7ll7mM | [holiday, observance] | [-95.712891, 37.09024] | [[6295630, 6255149, 6252001]] | None | 1.0 | country | 2020-12-26 00:00:00+00:00 | active | None | Kwanzaa (first day) | 2021-03-03 01:17:03+00:00 | 0 | None | [{'entity_id': 'hwGSnBtjr2YsYVWTAjVUy4', 'name': 'Kwanzaa (first day)', 'type': 'event-group', '... | None | None | None | False | 50 |
2 | None | public-holidays | US | None | Christmas Day celebrates Jesus Christ's birth. | None | 86399 | 2020-12-25 23:59:59+00:00 | 2017-01-04 23:06:52+00:00 | b3xEqLza0Nz0 | [holiday, holiday-christian, holiday-national, holiday-religious] | [-95.712891, 37.09024] | [[6295630, 6255149, 6252001]] | None | 1.0 | country | 2020-12-25 00:00:00+00:00 | active | None | Christmas Day | 2021-03-03 01:41:34+00:00 | 100 | None | [{'entity_id': 'huZZt9gWmDyBzwcHEmNiCc', 'name': 'Christmas Day', 'type': 'event-group', 'format... | None | None | None | False | 90 |
3 | None | observances | US | None | Christmas Eve in the United States is on December 24 each year. | None | 86399 | 2020-12-24 23:59:59+00:00 | 2017-01-04 23:06:51+00:00 | lljqM1zoRGVk | [holiday, holiday-christian, holiday-religious, observance] | [-95.712891, 37.09024] | [[6295630, 6255149, 6252001]] | None | 1.0 | country | 2020-12-24 00:00:00+00:00 | active | None | Christmas Eve | 2021-03-03 01:42:51+00:00 | 0 | None | [{'entity_id': 'hnmD6GjKBwXLdLKVBxtmcc', 'name': 'Christmas Eve', 'type': 'event-group', 'format... | None | None | None | False | 50 |
4 | None | public-holidays | US | None | Christmas Eve in the United States is on December 24 each year. | None | 86399 | 2020-12-24 23:59:59+00:00 | 2020-12-23 00:01:21+00:00 | 2GzBLK5LKTFQqdMj7n | [holiday, holiday-national] | [-95.712891, 37.09024] | [[6295630, 6255149, 6252001]] | None | 1.0 | country | 2020-12-24 00:00:00+00:00 | active | None | Christmas Eve | 2021-03-03 01:48:01+00:00 | 100 | None | [{'entity_id': 'hnmD6GjKBwXLdLKVBxtmcc', 'name': 'Christmas Eve', 'type': 'event-group', 'format... | None | None | None | False | 90 |
It is important to understand the output data. The most useful fields are the following:
id
The unique id of each event.title
The title of each event.description
The description of each event.start
The start time of each event.end
The end time of each event.duration
Duration of event in seconds.category
Category of events. e.g. school-holidays, public-holidays, observances.labels
Labels of each event.country
Country of each event.rank
PHQ rank of each event.aviation_rank
Aviation rank of each event.location
Latitude and longitude of each event.place_hierarchies
The hierarchies place ids.scope
The scope of each event.first_seen
The time when we received this event.
# Selecting the target fields
event_df = event_df[
[
"id",
"title",
"description",
"start",
"end",
"duration",
"category",
"labels",
"country",
"rank",
"aviation_rank",
"location",
"place_hierarchies",
"scope",
"first_seen",
]
]
# Creating a filename for our DataFrame and saving our final DataFrame as a CSV file
if "within" in parameters:
file_name = (
f"radius{radius}{radius_unit}_{latitude}_{longitude}_{start_time}_{end_time}"
)
else:
file_name = f"place_ids_{'_'.join(place_ids)}_{start_time}_{end_time}"
event_df.to_csv(f"data/event_data/{file_name}.csv", index=False)
print(f"DataFrame saved to data/event_data/{file_name}.csv")
DataFrame saved to data/event_data/radius10km_34.07_-118.25_2020-01-01_2020-12-31.csv
Appendix: Finding place_id
¶
Here is a guide on how to link between store locations and place_id
. Here the location
could be a city, a state, a country or a continent.
- Query
place_id
based onlocation
- Query
place_hierarchies
based onlatitude, longitude
- Query
location
based onplace_id
The full list of parameters that you could use in your query is documents at our Places API page.
PredictHQ uses the geonames places convention https://www.geonames.org/
1) Query place_id
based on location
¶
By using PredictHQ Places API, you can find the place_id
for a specific location
. By calling the API and setting q
to location
, the API will return the most relevant place_id
. Taking the top place_id
will provide the most relevant place_id
the location
is in.
# Example locations.
locations = ["Los Angeles", "California", "United States", "North America"]
place_id_lookup = pd.DataFrame()
for location in locations:
response = requests.get(
url="https://api.predicthq.com/v1/places/",
headers={
"Authorization": "Bearer {}".format(ACCESS_TOKEN),
"Accept": "application/json",
},
params={"q": location},
)
data = response.json()
df = pd.json_normalize(data["results"])
place_id_lookup = place_id_lookup.append(df.iloc[0], ignore_index=True)
place_id_lookup[["id", "name", "type"]]
id | name | type | |
---|---|---|---|
0 | 5368361 | Los Angeles | locality |
1 | 5332921 | California | region |
2 | 6252001 | United States | country |
3 | 6255149 | North America | continent |
2) Query place_hierarchies
based on latitude, longitude
¶
By using PredictHQ Places Hierarchies API, you can find the place_hierarchies
for a specific latitude, longitude
. By calling the API and setting location.origin
to latitude, longitude
, the API will return the most relevant place_hierarchies
.
# Example locations.
latitude_longitudes = [[34.07, -118.25]]
place_hierarchies_lookup = pd.DataFrame()
for latitude_longitude in latitude_longitudes:
latitude, longitude = latitude_longitude
response = requests.get(
url="https://api.predicthq.com/v1/places/hierarchies",
headers={
"Authorization": "Bearer {}".format(ACCESS_TOKEN),
"Accept": "application/json",
},
params={"location.origin": f"{latitude},{longitude}"},
)
data = response.json()
df = pd.DataFrame(data)
df["latitude"] = latitude
df["longitude"] = longitude
place_hierarchies_lookup = place_hierarchies_lookup.append(df, ignore_index=True)
place_hierarchies_lookup
place_hierarchies | latitude | longitude | |
---|---|---|---|
0 | [6295630, 6255149, 6252001, 5332921, 5368381, 5376252] | 34.07 | -118.25 |
1 | [6295630, 6255149, 6252001, 5332921, 5368381, 5368361] | 34.07 | -118.25 |
For each latitude, longitude
, the response might include more than one hierarchy. The reason for this is we try to match the closest place's hierarchy but we also include the closest major city's hierarchy within a radius of 50km. This only applies if the level is below region and, if it exists, the major city's hierarchy will always be the second row of the DataFrame.
3) Query location
based on place_id
¶
By using PredictHQ Places API, you can find the location
for a specific place_id
. By calling the API and setting id
to place_id
, the API will return the most relevant location
. Taking the top location
will provide the most relevant location
the place_id
is in.
# Example locations.
place_ids = ["6295630", "6255148", "2510769", "2513413"]
location_lookup = pd.DataFrame()
for place_id in place_ids:
response = requests.get(
url="https://api.predicthq.com/v1/places/",
headers={
"Authorization": "Bearer {}".format(ACCESS_TOKEN),
"Accept": "application/json",
},
# The id could be a comma-separated list of place_ids. In this example, the
# events are queried based on each place_id.
params={"id": place_id},
)
data = response.json()
df = pd.json_normalize(data["results"])
location_lookup = location_lookup.append(df.iloc[0], ignore_index=True)
location_lookup[["id", "name", "type"]]
id | name | type | |
---|---|---|---|
0 | 6295630 | Earth | planet |
1 | 6255148 | Europe | continent |
2 | 2510769 | Spain | country |
3 | 2513413 | Murcia | region |