Snowflake Data Science Guide
Transforming Event Data into ML-Ready Features in Snowflake
This guide and the Features API
This guide is intended to provide guidance on generating machine learning ready features from PredictHQ's intelligent event data in Snowflake similar to the output of the Features API, but is not intended to be in parity with the comprehensive results of the Features API. If possible, our primary recommendation is to use the Features API as it provides more comprehensive results. For more information on the Features API, please go to this page and for a more detailed guide on using the Features API for Machine Learning, please see the Get ML Features guide.
If you don't know which is the best for you, please reach out!
Overview
This guide assumes you have access to PredictHQ’s events data in Snowflake via a Snowflake data share. The guide is to show customers how to generate aggregations similar to those provided by the Features API documented in the Features API list of available features. These features provide daily aggregated data which shows the sum of data for all events happening in a location - for example, the amount of people attending events around a location. The goal is to generate aggregated features that can be used in demand forecasting.
Snowflake's ease of use and integration have made it a popular choice as a cloud data warehouse and is one of the ways PredictHQ's intelligent event data can be accessed for various use cases. In this guide, machine learning ready daily level aggregated event data will be generated on a per-location basis intended to be similar to the data provided by the PredictHQ Features API and ready to be added to a training set as quickly as possible.
Requirements:
1. A list of locations with their respective latitude & longitude.
2. Access to PredictHQ's intelligent event data via Snowflake
3. Understanding of SQL & Snowflake's SQL Interface.
The process to follow:
Get the Suggested Radius for Each Location
Create the input table filled with locations
Set the date range that you want to retrieve data for
Choose which method to follow
Use the output in machine learning demand forecasting models or for other applications
Get the Suggested Radius for each Location
When querying events at the location level, a common way to retrieve those events is with a latitude, longitude, and radius to get the events within a given area. But, a common gap is knowing what radius to use when searching for events. Insert the Suggested Radius API.
The Suggested Radius API returns a radius that can be used to find attended events around a given location and takes into account a number of different factors like population density, the surrounding street network, and the industry vertical of the location.
As a first step, it is recommended to get the suggested radius for each location before moving forward with the guide. Below is an example of how to query the Suggested Radius API for a list of locations using the PredictHQ Python SDK. Note: this code will need to be run in a separate environment than Snowflake.
For more information on the Suggested Radius API, visit our documentation.
Create Input table used in both methods
To be able to reference the locations further in the guide, a table of locations is required as input (called SAVED_LOCATIONS). Please fill this table with The SAVED_LOCATIONS input table requires this format:
store1-chicago
41.81310
-87.65860
mi
2023-07-01
2023-12-31
Hyde Park
51.50736
-0.16411
mi
2024-01-01
2024-03-31
store10-new-york
40.73061
-73.93524
...
...
...
location: a unique identifier for the location.
latitude/longitude: it is recommended to include 5 decimal places.
radius: the value returned from the using the Suggested Radius API.
radius_unit: coded for either “km” (kilometers) or “mi” (miles).
date_start/date_end: the date range for the data to be returned. Can be changed.
Here is the input table used when running this code. Note the datatypes of each column for the inputs.
By default, 3 months of historical data is returned. If the model is being trained, we recommend, at minimum, two years of historical data, but this can be changed as needed. If you are forecasting for a future period then the date range should reflect the period you are forecasting for - e.g. the next 2 weeks.
Once the input table is in the format of the above, the below code shapes that table to be in a day-by-day format of the input called SAVED_LOCATIONS_DAILY:
Choose which Method to use
Call the Features API with Python and save the output ML Features into Snowflake.
Use SQL in Snowflake to run over the events table and create the ML Features.
Integrating PredictHQ Features into Your Demand Forecasting Model
The ML_FEATURES_FOR_LOCATIONS table offers ready-to-use features for forecasting models. Merge these features with current demand data using the location and date as keys. Train models using historical data from this table and use future data for forecasting.
For more information on creating machine learning-ready features from PredictHQ’s intelligent event data, check out the Get ML Features guide or this blog on Enhancing Demand Forecasting with PredictHQ and PowerBI: A Technical Exploration. While these resources discuss the Features API, remember to use the ML_FEATURES_FOR_LOCATIONS table instead.
Last updated