SQL Method Guide

Transforming Event Data into ML-Ready Features using SQL

Using Snowflake SQL to recreate Features-API

This guide uses a publicly available PredictHQ event sample table called PREDICTHQ_EVENTS_RETAIL_LONDON Please change this table name in all instances below with the name of the events data table that has been provisioned by PredictHQ as per the Snowflake Secure Data Share.

The rest of the guide also uses temporary tables but these tables can be turned into permanent tables as needed.

Once SAVED_LOCATIONS has been created as per the parent page of this guide, the following steps are required and blocked out:

  1. Modify the input input table format to use with the code in this guide

  2. Generate daily aggregated statistics for each location by…

    • attendance based features

    • rank based features

    • impact based features

    • The date range in the following code examples should be updated based on the desired granularity:

      • For training a machine learning model, update the dates to get historical data for the locations

      • If running a model in production and forecasting future demand, update the dates for the visibility window of the forecast - e.g. the new week, month, or months

  3. Join all features together in a single output table

Step 1: Modify the input table format

Once the SAVED_LOCATIONS input table is created, the below code shapes that table to be in a day by day format of the input called SAVED_LOCATIONS_DAILY:

Step 2: Calculating Daily Aggregated ML Features with SQL in Snowflake

Each Feature set will be calculated in blocks, see the column headers in each code block below for which Features are available to be generated.

PHQ Attendance Features

Values are calculated as the sum of predicted attendance for the day at a given location within the defined radius.

If metrics other than SUM are desired, use the below code as a template for each column. The category name part of the code for each column (in these examples defaulted to ‘community’) will need to be replaced depending on which PHQ Attendance Feature is intended to be called. Refer to the column code above for available Feature categories.

PHQ Rank Features

Values are calculated as a count of events occurring at each rank level, per day, per location. If an event occurs over multiple days, it will have a result in each day until the event is over. Each rank level is returned as its own column.

PHQ Impact Features

Values are calculated as MAX of the Ranks of events occurring over each day, showing the highest rank Severe Weather event of each type occurring per day.

If metrics other than MAX are desired, use the below code as a template for each column. The weather_category name part of the code (in these examples defaulted to ‘air-quality’) will need to be replaced depending on which feature is intended to be called. Refer to the column code above for the available weather_category features.

Step 3: Final Select for all Features

The following code will pull features generated above all into a single table called ML_FEATURES_FOR_LOCATIONS. This output is intended to be used directly by Machine Learning models. If unsure what features to use, it is recommended to create a Beam analysis for the locations and leverage the category importance results with the “View ML Features” option (see here).

The output table should look like this (Note: the below example is only showing the first 3 columns):

DATE
LOCATION
PHQ_ATTENDANCE_COMMUNITY
PHQ_ATTENDANCE_CONCERTS
PHQ_ATTENDANCE_CONFERENCES

2024-01-01

Hyde Park

68

1,839

1,578

2024-01-02

Hyde Park

0

469

126

2024-01-03

Hyde Park

200

346

139

2024-01-04

Hyde Park

0

2,029

324

2024-01-05

Hyde Park

120

691

238

Refer Back to Main Guide

Last updated

Was this helpful?