View Jupyter notebook on the GitHub.

Regressors and exogenous data#

20ab965eea8c45c798580514edbba834

This notebook serves as a tutorial for:

Loading regressors to TSDataset
Training and using a model with regressors

Table of Contents

What is regressor?
- What is exogenous data?
Dataset
- Loading Dataset
- EDA
Forecast with regressors

1. What is regressor?#

In previous tutorials, we have shown how to work with target time series.

Target time series is a time series we want to forecast.

But imagine that you have information about the future that can help model with forecasting target time series. It may be information about holidays, weather, recurring events, marketing campaigns, etc. We will call it regressor.

Regressor is a time series that we are not interested in forecasting, however, it may help to forecast the target time series.

To apply an ML model that uses regressors to make more accurate forecasts, we need to know how regressors affected the target time series in the past and information about their values in the future.

What is additional data?#

There is also data that we don’t know in advance. However using it still allows us to make more accurate forecasts. This data we will call additional data. For example, if many users bought a new phone few weeks ago we should expect more support requests on this product.

In order to use additional data in ML models we should create regressors out of them. For example, it could be done with LagTransform or TrendTransform.

In this tutorial we will not look at additional data and will focus on regressors.

2. Dataset#

ETNA allows working with regressor as convenient as with target time series.

We are going to forecast the time series from Tabular Playground Series - Jan 2022. The dataset contains daily merchandise sales – mugs, hats, and stickers – at two imaginary store chains across three Scandinavian countries. As exogenous data, we will use Finland, Norway, and Sweden Weather Data 2015-2019 dataset containing daily country average precipitation, snow depth and air temperature data.

2.1 Loading Dataset#

First, let’s load the data.

[1]:

import pandas as pd
import warnings

warnings.filterwarnings("ignore")

target_df = pd.read_csv("data/nordic_merch_sales.csv")
regressor_df = pd.read_csv("data/nordics_weather.csv")

The next step is converting the data into the ETNA format. Code that allows us to do that is identical for target time series and exogenous data.

[2]:

from etna.datasets import TSDataset

target_df = TSDataset.to_dataset(target_df)
target_df.tail()

[2]:

segment	Finland_KaggleMart_Kaggle Hat	Finland_KaggleMart_Kaggle Mug	Finland_KaggleMart_Kaggle Sticker	Finland_KaggleRama_Kaggle Hat	Finland_KaggleRama_Kaggle Mug	Finland_KaggleRama_Kaggle Sticker	Norway_KaggleMart_Kaggle Hat	Norway_KaggleMart_Kaggle Mug	Norway_KaggleMart_Kaggle Sticker	Norway_KaggleRama_Kaggle Hat	Norway_KaggleRama_Kaggle Mug	Norway_KaggleRama_Kaggle Sticker	Sweden_KaggleMart_Kaggle Hat	Sweden_KaggleMart_Kaggle Mug	Sweden_KaggleMart_Kaggle Sticker	Sweden_KaggleRama_Kaggle Hat	Sweden_KaggleRama_Kaggle Mug	Sweden_KaggleRama_Kaggle Sticker
feature	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target
timestamp
2018-12-27	573	414	177	1068	652	308	898	568	270	1604	1108	436	672	420	196	1127	745	319
2018-12-28	841	499	223	1398	895	431	1162	731	361	2178	1333	662	874	555	260	1540	990	441
2018-12-29	1107	774	296	1895	1398	559	1650	1113	518	2884	1816	874	1106	720	348	2169	1438	596
2018-12-30	1113	757	326	1878	1241	554	1809	1052	500	2851	1935	833	1133	730	336	2138	1303	587
2018-12-31	822	469	238	1231	831	360	1124	728	351	2128	1383	561	823	570	250	1441	1004	388

As you can see, the target ends in 2018, and the exogenous data ends in 2019, so we have prior information a year ahead. This implies that our exogenous data contains only regressors.

[3]:

regressor_df = TSDataset.to_dataset(regressor_df)
regressor_df.tail()

[3]:

segment	Finland_KaggleMart_Kaggle Hat					Finland_KaggleMart_Kaggle Mug					...	Sweden_KaggleRama_Kaggle Mug					Sweden_KaggleRama_Kaggle Sticker
feature	precipitation	snow_depth	tavg	tmax	tmin	precipitation	snow_depth	tavg	tmax	tmin	...	precipitation	snow_depth	tavg	tmax	tmin	precipitation	snow_depth	tavg	tmax	tmin
timestamp
2019-12-27	0.028249	109.550000	-8.529630	-3.161039	-10.895425	0.028249	109.550000	-8.529630	-3.161039	-10.895425	...	0.105079	141.220930	-4.277778	-2.391204	-8.993458	0.105079	141.220930	-4.277778	-2.391204	-8.993458
2019-12-28	0.789266	116.421053	-9.107407	-4.703947	-15.288889	0.789266	116.421053	-9.107407	-4.703947	-15.288889	...	1.117688	142.955224	-3.866667	-3.006542	-11.593056	1.117688	142.955224	-3.866667	-3.006542	-11.593056
2019-12-29	4.976966	117.117647	-0.418519	1.264052	-7.722078	4.976966	117.117647	-0.418519	1.264052	-7.722078	...	1.758669	136.725146	1.755556	3.692056	-4.516204	1.758669	136.725146	1.755556	3.692056	-4.516204
2019-12-30	1.229775	160.500000	2.292593	3.344156	-0.202632	1.229775	160.500000	2.292593	3.344156	-0.202632	...	0.561996	120.740741	4.900000	6.135648	1.859070	0.561996	120.740741	4.900000	6.135648	1.859070
2019-12-31	0.225281	124.647059	-2.859259	1.580519	-6.921569	0.225281	124.647059	-2.859259	1.580519	-6.921569	...	0.848161	131.583333	1.722222	4.376606	-2.290278	0.848161	131.583333	1.722222	4.376606	-2.290278

5 rows × 90 columns

Then we have to create TSDataset with both target time series and exogenous data. TSDataset expects us to put target time series in df argument and exogenous data in df_exog. We should do it because regressors contain information about the target’s future. TSDataset ensures we don’t mix them.

In order to specify the columns of df_exog, which contains regressors, we need to use the known_future parameter. This allows TSDataset to determine which columns are regressors and which columns are additional data.

[4]:

ts = TSDataset(df=target_df, freq="D", df_exog=regressor_df, known_future="all")
ts.head()

[4]:

segment	Finland_KaggleMart_Kaggle Hat						Finland_KaggleMart_Kaggle Mug				...	Sweden_KaggleRama_Kaggle Mug				Sweden_KaggleRama_Kaggle Sticker
feature	precipitation	snow_depth	target	tavg	tmax	tmin	precipitation	snow_depth	target	tavg	...	target	tavg	tmax	tmin	precipitation	snow_depth	target	tavg	tmax	tmin
timestamp
2015-01-01	1.714141	284.545455	520.0	1.428571	2.912739	-1.015287	1.714141	284.545455	329.0	1.428571	...	706.0	3.47	5.415354	0.221569	3.642278	84.924623	324.0	3.47	5.415354	0.221569
2015-01-02	10.016667	195.000000	493.0	0.553571	2.358599	-0.998718	10.016667	195.000000	318.0	0.553571	...	653.0	3.80	5.097244	0.294882	2.414665	67.043702	293.0	3.80	5.097244	0.294882
2015-01-03	3.956061	284.294118	535.0	-1.739286	0.820382	-3.463871	3.956061	284.294118	360.0	-1.739286	...	734.0	1.61	2.140392	-1.776680	0.212793	79.945946	319.0	1.61	2.140392	-1.776680
2015-01-04	0.246193	260.772727	544.0	-7.035714	-3.110828	-9.502581	0.246193	260.772727	332.0	-7.035714	...	657.0	-1.35	-0.648425	-5.173123	0.226833	78.997290	300.0	-1.35	-0.648425	-5.173123
2015-01-05	0.036364	236.900000	378.0	-17.164286	-8.727564	-19.004487	0.036364	236.900000	243.0	-17.164286	...	512.0	-4.27	-3.027451	-9.544488	0.515601	79.736148	227.0	-4.27	-3.027451	-9.544488

5 rows × 108 columns

2.2 EDA#

TSDataset joins exogenous data and the target time series, so we can visualize and analyze exogenous data in the same way as target time series. More information in EDA notebook.

[5]:

ts.plot(column="snow_depth", n_segments=2)

../_images/tutorials_exogenous_data_12_0.png

[6]:

ts.plot(column="precipitation", n_segments=2)

../_images/tutorials_exogenous_data_13_0.png

[7]:

ts.plot(column="target", n_segments=2)

../_images/tutorials_exogenous_data_14_0.png

3. Forecast with regressors#

We will use LinearPerSegmentModel. It is a simple model that works with regressors.

Note: some models do not work with regressors. In this case, they will warn you about it.

We should forecast merchandise sales a year ahead using regressors with information about weather.

[8]:

from etna.models import LinearPerSegmentModel

HORIZON = 365
model = LinearPerSegmentModel()

ETNA allows to configure the transforms to work with exogenous data the same way as they work with the time series. In addition to this, transforms will automatically update information about regressors in TSDataset.

[9]:

from etna.transforms import FilterFeaturesTransform

from etna.transforms import MeanTransform  # math
from etna.transforms import DateFlagsTransform, HolidayTransform  # datetime
from etna.transforms import LagTransform  # lags

transforms = [
    LagTransform(
        in_column="target",
        lags=list(range(HORIZON, HORIZON + 28)),
        out_column="target_lag",
    ),
    LagTransform(in_column="tavg", lags=list(range(1, 3)), out_column="tavg_lag"),
    MeanTransform(in_column="tavg", window=7, out_column="tavg_mean"),
    MeanTransform(
        in_column="target_lag_365",
        out_column="target_mean",
        window=104,
        seasonality=7,
    ),
    DateFlagsTransform(
        day_number_in_week=True,
        day_number_in_month=True,
        is_weekend=True,
        special_days_in_week=[4],
        out_column="date_flag",
    ),
    HolidayTransform(iso_code="SWE", out_column="SWE_holidays"),
    HolidayTransform(iso_code="NOR", out_column="NOR_holidays"),
    HolidayTransform(iso_code="FIN", out_column="FIN_holidays"),
    LagTransform(
        in_column="SWE_holidays",
        lags=list(range(2, 6)),
        out_column="SWE_holidays_lag",
    ),
    LagTransform(
        in_column="NOR_holidays",
        lags=list(range(2, 6)),
        out_column="NOR_holidays_lag",
    ),
    LagTransform(
        in_column="FIN_holidays",
        lags=list(range(2, 6)),
        out_column="FIN_holidays_lag",
    ),
    FilterFeaturesTransform(exclude=["precipitation", "snow_depth", "tmin", "tmax"]),
]

The next steps are literally identical to the situation when we work with target time series only.

[10]:

from etna.pipeline import Pipeline

pipeline = Pipeline(model=model, transforms=transforms, horizon=HORIZON)

[11]:

from etna.metrics import SMAPE

metrics, forecasts, _ = pipeline.backtest(ts, metrics=[SMAPE()], aggregate_metrics=True, n_folds=2)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    3.2s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    6.5s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    6.5s finished

[12]:

metrics

[12]:

	segment	SMAPE
0	Finland_KaggleMart_Kaggle Hat	6.809976
1	Finland_KaggleMart_Kaggle Mug	7.897876
2	Finland_KaggleMart_Kaggle Sticker	7.566816
3	Finland_KaggleRama_Kaggle Hat	6.714908
4	Finland_KaggleRama_Kaggle Mug	7.443409
5	Finland_KaggleRama_Kaggle Sticker	7.540571
6	Norway_KaggleMart_Kaggle Hat	9.335215
7	Norway_KaggleMart_Kaggle Mug	11.929340
8	Norway_KaggleMart_Kaggle Sticker	11.455042
9	Norway_KaggleRama_Kaggle Hat	8.976252
10	Norway_KaggleRama_Kaggle Mug	11.691445
11	Norway_KaggleRama_Kaggle Sticker	11.594758
12	Sweden_KaggleMart_Kaggle Hat	6.837174
13	Sweden_KaggleMart_Kaggle Mug	7.319936
14	Sweden_KaggleMart_Kaggle Sticker	6.973164
15	Sweden_KaggleRama_Kaggle Hat	6.366672
16	Sweden_KaggleRama_Kaggle Mug	6.994042
17	Sweden_KaggleRama_Kaggle Sticker	7.081337

[13]:

from etna.analysis import plot_backtest

plot_backtest(forecasts, ts)

../_images/tutorials_exogenous_data_23_0.png

Supporting more work strategies for regressors and additional data is a future feature on the ETNA development roadmap.