Loading Event Data into a Data Warehouse
Take PredictHQ API data and load it into a data warehouse, in this example, GCP BigQuery.
This guide outlines the process for integrating PredictHQ's Events data into your data lake. It is common for customers to want to get data from our APIs and store it in their data lake. To support this, we have created this guide to help you with the integration process
We use Google Cloud Platform (GCP) as our primary example, but the methods for structuring the data table and making PredictHQ Events API calls can be applied to various data management systems. This guide is designed for users with a basic understanding of GCP or similar data warehousing solutions and the necessary permissions to use them.
Please note that this guide focuses exclusively on integrating PredictHQ’s Events API data and does not cover the Features API.
If you are using Snowflake or AWS Data Exchange (ADX) for your data lake PredictHQ integrates with those products. See Receive Data via Snowflake and Receive Data via AWS Data Exchange for details.
Overview
This guide details the data structure of our Events API within a data warehouse environment and outlines two methodologies for creating an Events data table in GCP.
Throughout this guide, we will use the fictional example from our Filtering Guide, "Tom’s Pizzeria". This scenario will give us a practical illustration of the methods described and demonstrate how to tailor API queries to specific business needs. For an understanding of how parameters were selected for this example, refer to the Filtering Guide.
Requirements:
Access to PredictHQ Data
JSONL: Requires a PredictHQ account. Sign up here if you don’t already have one.
API: An API Access Token is necessary for accessing the data programmatically.
GCP permissions:
Ensure you have "BigQuery Data Owner" and "BigQuery User" permissions over the BigQuery environment.
A GCP BigQuery Service Account and its corresponding JSON key are required. See the GCP documentation for details.
Scenario: Tom’s Pizzeria
In this guide, we'll explore a hypothetical use case for “Tom’s Pizzeria”, a chain of restaurants with locations throughout the US headquartered in Seattle, Washington. Tom is interested in understanding how local events might influence his business operations and customer flow. Tom’s inventory system, website, and tools run off GCP. Tom wants to use event data in his staffing and inventory management systems to help anticipate the demand caused by events. He wants to show upcoming events near stores to his staff. To do this he needs to download events into his data lake.
For comprehensive details on selecting appropriate filters for this scenario, refer to our Filtering Guide. This guide will help us understand which events could impact Tom's business and how to configure our data queries accordingly. For our load into GCP, we want to bring through a larger number of events and then filter down to a specific pizzeria location using BigQuery once loaded. See Querying the Loaded Data section for more.
Tom's Data Parameters:
Attended Categories: community, conferences, concerts, expos, festivals, performing-arts, sports.
Date Range: Events active within the range June 1, 2024, to June 30, 2024.
Event Rank: Events with a rank greater than 30 indicate a significant likelihood of impacting local traffic and attendance.
Event Status: Both ‘active’ and ‘predicted’ events to ensure a comprehensive overview.
Location: Bring through all of Seattle first, and Tom can filter for his locations once it’s in BigQuery using our Suggested Radius API.
For the purposes of this guide, we have limited the example load to a single city for Tom to filter on. Users may bring through as much data as they have access to or require when doing an actual load. We find with data warehouse customers they may pull down all data they have access to into their data warehouse and then query it for relevant locations and data from their applications.
Data Load Methods
There are several methods available for integrating PredictHQ data with GCP BigQuery or other data warehouse solutions. This guide outlines two primary approaches, both compatible with each other’s data structure. Regardless of the method chosen for the initial data load, ongoing updates will require API code.
JSONL File Upload Method: This method provides a straightforward, code-free approach to data upload, by exporting data from PredictHQ’s Control Center web application. We recommend using JSONL uploads for the initial population of your data lake, especially in cases where there is a substantial volume of data, such as multiple years of historical data. Subsequent updates should be managed through API calls to ensure the data remains current.
API Connection Method: Connect to the PredictHQ Events API to download the latest data from the API into your data lake. This guide takes you through using the Python API connection method for GCP but similar steps would apply to other programming languages. Events data is dynamic with events changing all the time as events are canceled, postponed, shift location, or have other details change. Also, new events are being announced all the time. We recommend having a process that updates your data on a regular frequency, such as daily.
This guide will walk you through the initial data load, providing you with the tools and understanding necessary to create a robust connection to GCP. You'll learn how to structure your data effectively within your warehouse, setting the stage for potential automation and regular updates that you can implement as needed.
Table Data Structure
Regardless of the method chosen for initial data creation and loading, the table structure remains consistent. This ensures that both methods are interchangeable, delivering data in a uniform format. The data structure for the table is detailed below:
id
STRING
REQUIRED
parent_event
JSON
NULLABLE
title
STRING
NULLABLE
alternate_titles
JSON
NULLABLE
description
STRING
NULLABLE
category
STRING
NULLABLE
labels
JSON
NULLABLE
phq_labels
JSON
NULLABLE
rank
INTEGER
NULLABLE
local_rank
INTEGER
NULLABLE
phq_attendance
INTEGER
NULLABLE
entities
JSON
NULLABLE
duration
INTEGER
NULLABLE
start
TIMESTAMP
NULLABLE
start_local
TIMESTAMP
NULLABLE
end
TIMESTAMP
NULLABLE
end_local
TIMESTAMP
NULLABLE
predicted_end
TIMESTAMP
NULLABLE
predicted_end_local
TIMESTAMP
NULLABLE
updated
TIMESTAMP
NULLABLE
first_seen
TIMESTAMP
NULLABLE
timezone
STRING
NULLABLE
location
JSON
NULLABLE
geo
JSON
NULLABLE
scope
STRING
NULLABLE
country
STRING
NULLABLE
place_hierarchies
JSON
NULLABLE
state
STRING
NULLABLE
private
BOOLEAN
NULLABLE
impact_patterns
JSON
NULLABLE
predicted_event_spend
INTEGER
NULLABLE
predicted_event_spend_industries
JSON
NULLABLE
JSONL file Upload Method
This method is recommended for large data uploads, as it efficiently manages the transfer of large volumes of data better than direct API calls.
Search Control Center and Export JSONL
To locate and export the relevant event data into a JSONL file, we utilize the PredictHQ Control Center Search. This tool allows for precise querying of events based on specific criteria, ensuring that you retrieve only the most relevant information for your needs. For detailed instructions on optimizing your search with the Control Center, please refer to this guide on Control Center search capabilities.
Typically you may download all the data you have access to into your data warehouse. In that case, run a search for all events and download them.
In the context of our example Scenario for Tom's Pizzeria, they would be downloading events for all of the US and then querying for specific locations. Many customers may bulk load all the data they have access to by exporting it all and then importing it into their data lake. In this example, we’ll download events only for Seattle.
To do that, we searched for Seattle in the Control Center for the relevant period, status, and attended categories.
After configuring your filters and executing the search, simply select the Export option. For more details on exporting see the CSV Export Guide except select the JSONL file format instead of CSV. This JSONL can then be directly uploaded to your BigQuery setup, as detailed in the next section.
Create a Table via JSONL Upload
Setting up a BigQuery table with a JSONL file is a straightforward process, provided you have the necessary permissions on GCP. Before beginning, ensure you are clear about which dataset will host the data. Here are the steps to create the table once you have found and highlighted the dataset in GCP BigQuery:
Click Create Table: Navigate to the dataset you wish to create the table in, click the hamburger menu and select “Create Table”.
Select the File Location: Select the JSONL export that you have downloaded somewhere on your computer.
Name the Table: Give the table about to be created a name that suits
Manually Define Schema: This step involves specifying the schema details manually. You must accurately define each column, ensuring that the datatype and column names precisely match those in the Table Data Structure. Any discrepancies in spelling or datatype will lead to errors during the upload process. While you have flexibility to modify the schema by adding or removing columns based on your specific data requirements, this guide focuses on the recommended fields we suggest including.
Advanced Options: Expand the Advanced Options and tick the checkbox “Unknown Values”. This setting allows the system to gracefully handle missing information in specific columns of some records, ensuring that rows with incomplete data are not rejected or throw errors during the upload process.
Create the Table: Click the "Create Table" button to finalize the creation.
This method allows initializing your BigQuery table with a JSONL dataset suitable for bulk data uploads. However, it does not support ongoing data refreshes. See the Keep Event Data Updated section for advice on setting up a regularly updated table after initialization.
API Connection Method
This method works well for smaller datasets as initial upload and must be used for continuous updates to the table as recommended in our Keep Event Data Updated section.
Table Creation Code
To establish the required data structure in BigQuery, you can utilize the following Python script. This script explicitly defines the columns and data types as laid out in our Table Data Structure, configuring them precisely as needed for your BigQuery table. Before executing this script, ensure you have the following prerequisites:
SERVICE_ACCOUNT_JSON: This is your service account JSON key, which is typically stored in a secure file. If your organization uses a different method to handle service account keys, please modify the code accordingly.
dataset_id: Specify whether this is a new dataset or an existing one in which you want to place this table.
table_id: Determine a name for your new PredictHQ data table in BigQuery.
It is advisable to manage this table creation code separately from other data processing scripts to maintain clarity and ease of updates.
This script sets up the initial table structure within BigQuery, providing a foundation for subsequent data loads.
Extract from Events API
This section outlines the process of querying the PredictHQ Events API using Python, using the example for Tom's Pizzeria. This approach is designed to ensure the data extracted is directly relevant to Tom’s operational needs. The methodology and rationale behind the data extraction parameters used below are explained in the Scenario section of this guide.
Before initiating the script, ensure you have configured the following prerequisites:
phq_access_token: Insert your PredictHQ access token here. For enhanced security, consider referencing the token indirectly if you prefer not to hard-code it into your script.
SERVICE_ACCOUNT_JSON: This is your Google Cloud Platform service account JSON key, which should be securely stored. Adjust the access method in the script if your organization's practices differ.
dataset_id: Designate whether this is a new or existing dataset in which you plan to house the PredictHQ data.
table_id: Assign a name to your new PredictHQ data table in BigQuery.
params: Modify these parameters as needed to align with the data you intend to extract from PredictHQ.
This Python script is crafted to fetch the necessary data from the Events API. This method loops through the paginated response from the API and pulls all results and columns. The next section covers transforming this data before we push for upload. For more details on any of the parameters we’ve used in the code below, see our Events API documentation, keeping in mind our Scenario to pull attended results in Seattle for Tom.
Transform API Responses
To ensure seamless compatibility with the BigQuery table structure, we run a transformation function applied to the data retrieved from the Events API in the previous step. The function performs essential adjustments, including:
Handling null or empty responses to maintain data integrity.
Restricting output to only the columns defined in our table structure ensures consistency and relevance.
Formatting complex fields to JSON, suitable for BigQuery ingestion.
Append this transformation code to the extraction code from above. After successful data extraction and transformation, the processed data is ready for loading into BigQuery.
Load Data into the Table
Once the data has been successfully extracted from the API and transformed to meet our schema requirements, the next step involves loading this data into the previously established BigQuery table. This process utilizes Python code integrated with the BigQuery API to load the data.
Below is the code block that you need to append to the end of the extraction and transformation script. It includes a basic retry mechanism to handle occasional upload failures, which is common in network-related operations. However, depending on your requirements for reliability and data integrity, you might consider implementing a more advanced retry logic.
With this step completed, the data from PredictHQ Events API is now populated into your BigQuery table and is ready for analytical querying. This setup initially caters to a single load of data; however, to maintain the relevance and timeliness of your data, consider adapting this script to periodically update the dataset based on changes reflected in the "updated" timestamp column of the source data. See the section below on updating your data.
Below is the full code where we have combined all these 3 code parts into one executable.
Keep Event Data Updated
Event data is dynamic and events can change frequently. This happens when events are canceled, posted, or have details updated. Also, PredictHQ’s pipeline is constantly fetching new events so new future events are always being added and can be downloaded via the API.
To keep your data updated see Keep Data Updated via API. Use a similar code to that above using the ‘updated’ parameter to filter for recently changed events. This will extract all new events and updates to events. Check for events updated since your last table update using the ‘updated’ timestamp column. You will need to code for updating and replacing the data in BigQuery according to your preferred data update standards, but the structure will be the same as outlined above.
We recommend running a daily update process (such as a cron job) that calls the PredictHQ API and updates the data in your data lake.
Querying the Loaded Data
Once the data is successfully loaded into BigQuery, you can begin querying it to derive insights relevant to your use case. This section provides an example of a BigQuery SQL query tailored to Tom’s scenario as outlined earlier. While the initial data load might have utilized specific filters via the Control Center or API parameters, it's often useful to perform additional queries directly within BigQuery. This capability is particularly valuable if you have loaded a broader dataset and need to perform dynamic or complex filtering post-load.
Some common fields to query are:
category: filter down to the relevant categories most likely impacting your industry
start and end: filter on start and end dates being in a specified time period. You might be looking at events in the next week or month or longer.
rank: is commonly used to filter out smaller events. Filter where rank is equal to or greater than a specific field to filter out smaller events.
geo: This field contains geojson data (see here for more details) on the location event. For attended events, this field typically holds the latitude and longitude of the point at which the event is occurring. It can also hold polygon information for events that cover a wide area like marathons or severe weather events. For marathons, the polygon shows the route of the marathon. Query on this field to find all events in an area like a radius.
See the Filtering Guide for more examples. The SQL example below shows how to query these fields in the database.
Below is a sample BigQuery SQL query that aligns with the parameters specified for our example. This query filters events based on the categories, date range, event rank, and geographical proximity to Tom’s location.
This type of query is used to find all events around a location with a specified radius for a business location. For example a restaurant, hotel, store, parking garage, or any other business location. Once you have the data loaded into GCP you will want to find how events are impacting your locations. Use the type of query below for each location to get all the events around that location.
This query will retrieve records that meet all the specified criteria, allowing Tom to identify events that could potentially influence the operations and traffic at this pizzeria in Seattle. Modify the above query to fit the specific fields and data types of your table if they differ from this example, and fill your latitude and longitude for your locations with a radius suggested by our Suggested Radius API.
Visually this type of query allows you to pull all the events in a radius as shown in the image below:
A common example is customers often look at events occurring in the next 1 to 3 months and may display this information in their application, in a BI tool, or in other types of products and tools. A common approach to doing this can be to have a table with a list of your business locations with latitude and longitude for each. For each, call the Suggested Radius API to store the radius and then look up your store locations in the table. For example you may have a table of locations like that below:
store1-chicago
41.8131
-87.6586
4.11
mi
2023-07-01
2023-12-31
Hyde Park
51.50736
-0.16411
2.06
mi
2024-01-01
2024-03-31
store10-new-yor
40.73061
-73.93524
...
...
...
...
You can look up location details from this table (to find the lat/long for a store for example) and then use it to find events with SQL like that shown in the example above. See our Snowflake Data Science Guide for an example of doing this in Snowflake.
Using BigQuery for these queries ensures that you leverage powerful, scalable SQL analytics over large datasets, enabling responsive decision-making based on the latest event data available in your system.
Utilizing PredictHQ Data in Your Data Warehouse
Having integrated PredictHQ's rich events data into your data warehouse, the opportunities to leverage this data are extensive. By now, you've successfully set up your data structure within Google Cloud Platform's BigQuery and have a solid understanding of the JSONL file Upload or the API Connection methods. Here’s how you can maximize the value of PredictHQ data within your organization:
1. Cross-referencing with Internal Datasets
Enhance the granularity and relevance of your internal analytics by cross-referencing PredictHQ events data with your own datasets. For instance, you can correlate sales data with event occurrences to analyze the impact of local events on sales performance. This cross-analysis can be crucial for demand forecasting and strategic planning.
Customers sometimes use fields like Placekey to join events data and other location data.
2. Demand Analysis and Forecasting
PredictHQ data can significantly enhance your demand forecasting models, especially for businesses that are impacted by local events, such as retail, hospitality, and transportation. By understanding when significant events are happening, you can better predict and prepare for attendance surges or declines.
You can use Event data from your data warehouse in your demand forecast to improve forecast accuracy. See our Snowflake Data Science Guide for an example of how you can implement ML features for demand forecasting in a data warehouse. Although that example shows how to do this in Snowflake, a similar approach applies to other data warehouses. See also Improving Demand Forecasting Models with Event Features.
3. Building Tailored Reports
Use the data within BigQuery to create detailed reports and dashboards that monitor the effects of events on your business operations. These reports can provide actionable insights to business users across your organization, from marketing teams planning campaigns around major events to supply chain management preparing for increased activity.
See Use Events Data in Power BI for an example of building reports in Power BI. You can connect Power BI or other BI tools to your database to build dashboards and reports. See also How to Connect PowerBI to Google BigQuery.
4. Enhancing Customer Experience
Inform your customers about local events that might impact their experience with your service or product. For example, a transportation company could provide passengers with real-time updates about events that might affect travel times or service availability.
5. Event-Driven Marketing
Plan and execute marketing campaigns that align with upcoming events to capitalize on increased foot traffic or digital engagement. This targeted approach can improve marketing ROI by reaching audiences when they are most receptive.
By tapping into the detailed and predictive insights provided by PredictHQ, your business can not only anticipate the impact of external events but also strategize proactively to harness potential opportunities or mitigate risks. Whether through enhancing predictive analytics, refining customer engagement strategies, or driving operational efficiencies, PredictHQ's events data serves as a powerful tool in your data-driven decision-making arsenal.
Last updated