To experiment with and run the code in our guides you can download the Jupyter notebook version of the guides from our GitHub repository

Correlating events to demand - a guide for Data Scientists

Introduction

This guide is for data scientists and technical teams working on correlation scenarios.

It maps out the steps you will need to take to correlate event impact data with your demand data so you can identify how events drive your demand. This enables you to factor these events into your forecasts in advance, leading to more profitable and efficient operations.

This guide builds on concepts that we covered in our getting started guide for data scientists. We recommend you read that guide first.

This guide is written in Jupyter notebook format to allow you to download and run the code samples to get up and running quickly. We provide sample files with the guide to get you started. This enables you to change the code and data in order to experiment with your own data and scenarios. We provide Python code samples for all the topics covered in the guide to give you a step-by-step resource you can use for your own correlation scenarios.

After running through this guide you should know what our Aggregate Event Impact data is and how to use it. You should have an understanding of an approach for using time series modeling to extract incremental demand. You will be able to correlate aggregate impact data with the incremental demand you have extracted.

In this guide, we use the following open source technologies:

See also our Youtube video on correlating events and demand.

Now let's dive in and get started!

Introducing Aggregate Event Impact

The Aggregate Event Impact endpoint is a feature that allows customers to aggregate event impacts for a location in a given time period. Because PHQ Rank™ and Aviation Rank™ are log scale numerical values, simply summing up all the rank values does not deliver a meaningful result.

PredictHQ’s Aggregate Event Impact tool does this calculation for you. It can be used with PHQ Rank™ or Aviation Rank™. It results in a numerical output that can be used directly in modelling.

How our ranks work at a glance:

  • PHQ Rank™ uses factors like attendance information, venue size and related factors to calculate impact. The impact value for the PHQ Rank™ based Aggregate Event Impact reflects approximate attendance the events on a given day.
  • Aviation Rank™ identifies an event’s impact based on airline demand. The impact value for the Aviation Rank™ based Aggregate Event Impact is a proxy for passenger numbers travelling for events on a given day.

The Aggregate Event Impact endpoint allows customers to aggregate event impact for a location in a given time period. It is useful for businesses seeking to correlate their demand with verified event data. Once you have achieved correlation, you can use Aggregate Event Impact to predict future impact on your business.

The location value for Aggregate Event Impact could be a city, or a latitude and longitude and radius. Use latitude and longitude and a radius around a specific point for finding the impact for a branch of your business like a store, hotel or warehouse. You can use city level or a larger radius if you want to look at impact across multiple physical locations.

To use Aggregate Event Impact:

  • Begin by identifying your location. If you are using a physical bricks and mortar building then you can use the latitude and longitude of the building and a radius around it to look at events. For example, if you wanted to identify event demand on a hotel in Seattle, you might create a 20 mile radius around it, and use that as your location.
  • Next, identify the timeframe you want to investigate - whether that’s a day, week, month or more.
  • Choose which rank you want to use to calculate the impact value - PHQ Rank™ or Aviation Rank™.
  • Finally, call our Aggregate Event Impact endpoint to deliver you the values for event impact in that area and timeframe.
  • Remember you can use our data exporter to export a list of events around a location and look at their impact on your business. See our getting started guide for more details on how to do that.

Let's take an example of how Aggregate Event Impact works:

  • On May 28th, 2019, there were a number of events in Seattle. For example the Northwest Folklife Festival, The Seattle International Film Festival, a Texas Rangers vs Seattle Mariners baseball game, The Georgetown Orbits concert, as well as the Annual Meeting of the Consortium of Multiple Sclerosis Centers and many more.
  • The Aggregate Impact value for the whole of Seattle using PHQ Rank™ on that day is 38,253. This represents the total combined impact of all the events on that day. Please note this is an indicative number, not the exact total of expected attendees.
  • Using this value once it has been correlated with your demand, you will be able to identify how much additional demand to prepare for.

Note, you can choose your location and radius and then filter down the impact to events on around a specific location. You can also choose to filter aggregate impact on values like rank (to include higher ranked events) or to specific categories based on what has the most impact to your business.

Aggregate Event Impact Data

PredictHQ's Aggregate Event Impact endpoint not only includes the overall daily impact for the attendance-based event categories, but also impact and rank levels for individual categories.

Here is a summary of the fields returned by the PHQ Rank™ based aggregate event impact.

Field Description
date Date of aggregate event impact
impact The total impact of all active events on a given day
count The number of active events on a specific date
categories A map of the categories supported by the endpoint and the count for each category
categories_impact A map of the categories supported by the endpoint and the impact for each category
rank_levels A map of count of events per rank level returned for a given day
rank_levels_impact An map of the total impact per rank level returned for a given day

More information on aggregate event impact is on PredictHQ's technical documentation.

To get you up and running quickly we have exported AEI events data into CSV files for use in this guide. These files contain data for Seattle in 2019. We took a geographical point in the center of Seattle and retrieved events within a specified radius of 20 miles. This CSV file contains all events with a PHQ Rank™ based AEI.

If you want to query aggregate impact data directly for your own testing you can sign up for a developer plan and use our data exporter or API to retrieve the data. Our data exporter allows you to export data into CSV or JSON format.

In [1]:
import pandas as pd
from rstl import STL

import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
In [2]:
%matplotlib inline

Load AEI data for 2019.

In [3]:
# Load AEI data
aei_df = pd.read_csv('https://raw.githubusercontent.com/predicthq/phq-data-science-docs/master/correlation/aei.csv', index_col=0)
aei_df.index = pd.to_datetime(aei_df.index)

In this guide, we will be using the overall daily aggregated event impact across categories.

The data should look like the following:

In [4]:
aei_df.head()
Out[4]:
count impact community_count concerts_count festivals_count performing-arts_count sports_count conferences_count expos_count public-holidays_count ... community_impact concerts_impact festivals_impact performing-arts_impact sports_impact conferences_impact expos_impact public-holidays_impact school-holidays_impact observances_impact
2019-01-01 4.0 12000.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 1.0 ... 0.0 0.0 6000.0 0.0 6000.0 0.0 0.0 0.0 0.0 0.0
2019-01-02 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2019-01-03 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2019-01-04 2.0 6000.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 6000.0 0.0 0.0 0.0
2019-01-05 3.0 12000.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 ... 0.0 0.0 0.0 0.0 6000.0 0.0 6000.0 0.0 0.0 0.0

5 rows × 22 columns

In [5]:
aei_df.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 365 entries, 2019-01-01 to 2019-12-31
Data columns (total 22 columns):
count                     365 non-null float64
impact                    365 non-null float64
community_count           365 non-null float64
concerts_count            365 non-null float64
festivals_count           365 non-null float64
performing-arts_count     365 non-null float64
sports_count              365 non-null float64
conferences_count         365 non-null float64
expos_count               365 non-null float64
public-holidays_count     365 non-null float64
school-holidays_count     365 non-null float64
observances_count         365 non-null float64
community_impact          365 non-null float64
concerts_impact           365 non-null float64
festivals_impact          365 non-null float64
performing-arts_impact    365 non-null float64
sports_impact             365 non-null float64
conferences_impact        365 non-null float64
expos_impact              365 non-null float64
public-holidays_impact    365 non-null float64
school-holidays_impact    365 non-null float64
observances_impact        365 non-null float64
dtypes: float64(22)
memory usage: 65.6 KB

Let's visualize the daily aggregate event impact data.

In [6]:
fig = go.Figure([go.Scatter(x=aei_df.index, y=aei_df['impact'], name='aei', line_color='#f2327e')])
fig.update_layout(title='Figure 1: Aggregate Event Impact')
fig.show()