To experiment with and run the code in our guides you can download the Jupyter notebook version of the guides from our GitHub repository

Getting started with PredictHQ's data - a guide for Data Scientists

The Getting Started Guide introduces Data Scientists to our intelligent events data and how to start working with it. Many companies work towards identifying the causes of their demand so they can forecast demand more accurately. Demand intelligence is a powerful tool businesses can use to improve their forecasting by factoring in the impact that events have.

This guide focuses on using PredictHQ's verified event data with standard Data Science tools. Note that the Getting Started Guide will not get into the details of using our ranks, correlation and demand forecasting. These topics will be covered in other data science guides. See the frequently asked questions section for more details on our event data, coverage, categories, ranking and other topics.

PredictHQ Event Data

PredictHQ’s event data goes through an extensive process to remove duplicates, spam and bad data so it can be as clean as possible for Data Scientists. Our pipeline aggregates, enriches and cleanses the data to provide a high quality consistent source of intelligent events data. The following table shows some of the fields in our events data with a description of each. Please note that only a subset of all the event fields were selected for the purpose of this guide. For more fields see the API documentation.

Field Description
id Unique event id
title Title of the event
start Start date of the event
end End date of the event
timezone Time zone of the event
category Category of the event
rank A log scale numerical value between 0 and 100 that represents the potential impact of an event
local_rank A log scale numerical value between 0 and 100 that represents the potential impact of an event in its geographical location
location Geo location of the event in GeoJSON order [lon, lat]

This guide uses Python 3 and the data manipulation is performed using Pandas.

Let's get started.

In [1]:
import pandas as pd
import numpy as np
import datetime
import plotly.express as px

To get you up and running quickly we exported a sample of events into CSV files and linked it to this guide. These files contain data for Seattle. We took a geographical point in the center of Seattle and retrieved events within a specified radius. One CSV file contains all events that are 30 miles from the center of Seattle (for PHQ Rank™) and the other file contains events that are 5 miles from the center (for Local Rank™).

We will start with the PHQ Rank™ example:

In [2]:
df_events_data = pd.read_csv("https://raw.githubusercontent.com/predicthq/phq-data-science-docs/master/getting-started/PHQ_Rank.csv")

Note: If you wish to try the samples in this guide yourself with different data you will need to sign up for a PredictHQ account and an access token to access our events data. You can sign up for a Developer Plan on the PredictHQ website and create an access token by following the Quickstart Guide in the Developer Documentation. Also see the documentation on using Control Center to assist you in getting up and running quickly. The data can be accessed via API or the Data Exporter Tool.

The data should look like this:

In [3]:
df_events_data.head()
Out[3]:
id title start end timezone category rank local_rank location
0 22D8XxrLJAZdDGALmJ My Fair Lady 23/11/2019 0:00 23/11/2019 0:00 America/Los_Angeles performing-arts 16 42 [-122.031746, 47.613601]
1 22zjRLwNhD7yVuK9dN Noises Off 14/12/2019 0:00 14/12/2019 0:00 America/Los_Angeles performing-arts 16 46 [-122.057285, 47.331313]
2 22zsx4msWs9bfMj5EA Seattle Data Analytics Lunch and Learn 5/01/2020 0:00 5/01/2020 0:00 America/Los_Angeles community 11 23 [-122.337424, 47.623017]
3 232QmpoTc2cFMetxwV Debtors Anonymous 10/01/2020 0:00 10/01/2020 0:00 America/Los_Angeles community 11 38 [-122.643811, 47.734308]
4 23No3yBrTzEx22oCo9 PMP Exam Prep Boot Camp in Seattle 28/01/2020 0:00 29/01/2020 0:00 America/Los_Angeles community 22 32 [-122.347585, 47.611565]

The first step is to look at the general information in the events data.

In [4]:
df_events_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5907 entries, 0 to 5906
Data columns (total 9 columns):
id            5907 non-null object
title         5907 non-null object
start         5907 non-null object
end           5907 non-null object
timezone      5907 non-null object
category      5907 non-null object
rank          5907 non-null int64
local_rank    5907 non-null int64
location      5907 non-null object
dtypes: int64(2), object(7)
memory usage: 415.5+ KB

There are 5907 events each with a PHQ Rank™ (rank) and Local Rank™ (local_rank).

Next, we will show some statistics about the events:

In [5]:
df_events_data.describe()
Out[5]:
rank local_rank
count 5907.000000 5907.000000
mean 26.963772 44.784493
std 14.903086 13.937239
min 11.000000 21.000000
25% 16.000000 34.000000
50% 17.000000 42.000000
75% 38.000000 54.000000
max 90.000000 100.000000

Exploratory Analysis of the Events Data

Category

PredictHQ classifies events into 16 different categories that can be generally divided into two main groups: scheduled and non-scheduled events. The major categories in the scheduled events are concerts, conferences, expos, festivals, sports etc while the unscheduled event categories are airport delays, severe weather, disasters, etc. For our demand forecasting tools (which will be covered in other guides) the event categories we currently focus on are concerts, performing arts, sports, expos, conferences, community and festivals.

To obtain a list of categories in the data use the following call:

In [6]:
df_events_data.category.unique()
Out[6]:
array(['performing-arts', 'community', 'expos', 'concerts', 'festivals',
       'sports', 'conferences'], dtype=object)

The number of events for each category is calculated as follows:

In [7]:
df_category = pd.DataFrame(df_events_data.groupby('category')['id'].count()).reset_index()
df_category.columns = ['category', 'number_of_events']
df_category
Out[7]:
category number_of_events
0 community 1644
1 concerts 1959
2 conferences 122
3 expos 58
4 festivals 122
5 performing-arts 1881
6 sports 121

To show a bar plot of categories use the following code:

In [8]:
fig = px.bar(df_category, x='category', y='number_of_events', labels = {'number_of_events': 'number of events'}, title='Figure 1')
fig.show()