Handling multi-day and Umbrella events

Overview

PredictHQ’s events data includes events of different duration, from events that may be less than an hour-long to events that can last more than a week. For our 7 attended event categories, we expose the actual or predicted attendance for events in our phq_attendance field. The phq_attendance field works slightly differently for different categories. For many categories, it is the total attendance for an event over its full duration. For other categories (like conferences), it reflects the daily attendance.

For example, the phq_attendance for a big event like the 2019 Tour de France is 12,000,000 which represents the total attendance for the full duration of 22 days. It is not the daily attendance. The daily attendance for that event is closer to 545,000 people. To avoid overcounting attendees for multi-day events, phq_attendance must be adjusted to the daily level (if not already).

PredictHQ also handles cases where one event (child) belongs to another (parent). This type of event is called an Umbrella event. Umbrella events are often multi-day events but can also be single-day events with multiple sessions if the same attendees are expected, for example the games of a rugby sevens tournament. When looking at daily attendance or event counts it’s important to use either parent events or child events, but not both.

Handling attendance for multi-day events

The Features API has advanced logic for handling multi-day events. For some categories, phq_attendance is the daily attendance. For categories that have multi-day events, such as festivals, community events, expos, and sports, there is additional logic for how phq_attendance is distributed to each day. We encourage customers to use the Features API to find aggregations on attendance like the sum of daily attendance.

Below is an example of how phq_attendance might be distributed for a golf tournament. This is a multi-day sports event so the phq_attendance of 63,000 is the total attendance across the full duration. The daily attendance is not evenly distributed across the week as higher attendance is expected on the weekend.The Features API deals with distributing attendance across each day and takes into account uneven distributions.

However, if you still need to calculate daily attendance manually please see the details in the section below. For example, if you download events data into a data lake and calculate features on top of the data lake then you will need to handle multi-day events attendance.

Using phq_attendance for multi-day events

Our definition of multi-day events is: events that take place across more than one day, i.e. overlap on more than one calendar day, while single-day events take place within one day, i.e. start and end on the same calendar day.

For the categories below, dividing phq_attendance by the total number of days is the simplest logic to adopt. More refined logic is to give each day the correct weighting to account for days with more expected attendance (typically between Thursday - Sunday).

For example below is an event with phq_attendance of 63,000 that has a 7 day duration. Dividing the attendance equally would result in the following daily attendance.

Category Using phq_attendance
Concerts Reflects daily attendance. These events tend to be 1 day or less.
Performing arts Reflects daily attendance. These events tend to be 1 day or less.
Conferences Reflects daily attendance, not total attendance. Use the same phq_attendance for each day.
Expos Divide phq_attendance by the total number of days. Friday, Saturday, and Sunday have greater attendance and can be given a slightly higher weight.
Sports Divide phq_attendance by the total number of days. Friday, Saturday, and Sunday as well as the final day have greater attendance and can be given a slightly higher weight.
Festivals Divide phq_attendance by the total number of days. Friday, Saturday, and Sunday have greater attendance and can be given a slightly higher weight.
Community Divide phq_attendance by the total number of days. Friday, Saturday, and Sunday have greater attendance and can be given a slightly higher weight.

Umbrella Events (beta)

Umbrella events refer to the case where we have a parent event that contains one or more child events. For example, the United States Formula 1 Grand Prix in 2019 has child events for the qualification, 3 practice events, a concert that occurs at the Grand Prix, and the actual race event (there are 12 child events in total). The parent event is for the entire Grand Prix that runs from the 1st of November to the 3rd of November 2019. Both the parent and child events are part of the wider Umbrella event.

Note

Umbrella events are currently in beta. PredictHQ currently has coverage for high-impact Umbrella events only. Up to the end of 2021, we will be increasing our coverage of Umbrella events. At this time we encourage our customers to start using them and to pull through updates as we update the events (see staying updated).

Child events are indicated by the presence of the parent_event field. Child events will have a parent_event_id in this field indicating the id of the parent event. For example, the Formula 1 race child event is 5uRg7CqGu7DTtu4Rfk and the Formula 1 parent event is w7dYyrFwTUQGYE6euv. The Formula 1 race child event has the following parent event info:

    "relevance": 0.0,
    "id": "5uRg7CqGu7DTtu4Rfk",
    "parent_event": {
        "parent_event_id": "w7dYyrFwTUQGYE6euv"
    },
    "title": "Formula 1 2019 - United States Grand Prix 2019 - Race",
    "description": "",
    "category": "sports",

Note

In the August 2021 release of Umbrella events you cannot yet find the child events ids of a parent event via the API. This feature will be supported in a future release.

Why Umbrella events matter

When using attended events in demand forecasting, a common approach is to look at the total attendance for events around a location per day. If you don’t take into account Umbrella events you can end up double-counting the attendance of the parent events and the attendance of the child events which leads to incorrect, inflated attendance figures and can reduce the accuracy of your forecasts.

Looking at the earlier US F1 Grand Prix in 2019 example, the parent event spanning 3 days has a phq_attendance of 258,000. If you were to divide the attendance of the parent event by 3 you get a daily attendance of 86,000. The actual race event running for around 3 hours on the 3rd of November has a phq_attendance of 120,000. Therefore, if you count the race event and the daily attendance for the parent event you’ll get 86,000 + 120,000 = 206,000, which is more than the real attendance on the 3rd. So you will overcount the attendance.

Note

In this example the child event attendance is different from the parent attendance divided by the number of days. Child event attendance may sometimes reflect more detailed attendance on the individual days of an event.

Another example of why you should use Umbrella events can be seen when looking at the daily attendance for events in Las Vegas in 2019. In the example below there is the World Rugby Sevens tournament from the 1st of March 2019 to the 3rd of March 2019. The parent event is for the entire tournament and there are many child events for individual games and rounds in the tournament. By not accounting for Umbrella events you get a massive spike in attendance at that time. A peak of 1.4 million is seen around the 2nd of March because both the parent event and child events are being counted.

Once you take into account Umbrella events and remove double counting, the real attendance on that day is closer to 400,000.

How to use Umbrella events to avoiding inflating attendance figures

When you use the Features API to get attendance-based aggregations like the sum of attendance, the API takes into account Umbrella events and avoids overcounting attendance. If you are downloading a copy of events data and calculating attendance yourself then you need to remove child events and use parent events for calculating attendance. If the parent event is a multi-day event then apply the multi-day logic to it mentioned in the multi-day section.

For example if you export a JSONL file in Control Center below is the Python code to remove the child events.

# The example below shows taking a file exported using the data exporter
# in Control Center and removing all the child events.

import json

filepath_to_jsonl = "your-export-file.jsonl"

with open(filepath_to_jsonl) as f:
    events = [json.loads(line) for line in f.read().splitlines()]
    events_without_parent = [event for event in events if not event.get("parent_event")]

An alternative is to use child events and filter out parent events. We do not currently recommend this approach, however. PredictHQ has a very large volume of events data from hundreds of sources and events change all the time. Events data change frequently as events are canceled, postponed, the locations can be shifted or other details can be changed. Because of this, we cannot guarantee we will have a complete set of child events for every parent event. For example, some parent events for a conference could have a child event for each day of the conference while others may not.

For this reason, we suggest using the parent event and filtering out the child events when aggregating attendance or counts of events or creating other features that could double count parent and child events.

Definitions

  • Parent event - Spans the full duration of an event and may have child events as part of it. Many parent events will be multi-day events such as the Olympics, a Formula 1 weekend, or a multi-day festival. These events will have a parent event for the whole event - like an event for the entire 2020 Olympic Games in Tokyo. Other examples include an event for the entire US Formula 1 or a rugby sevens tournament.

  • Child events - Individual events that are part of a parent event. For example, day 1 of the 2020 Olympic Games or the “Men’s 100m finals” in the Olympic Games. Or the Formula 1 qualification and practice events. All of these are examples of child events.