View Jupyter notebook on the GitHub.

Regressors and exogenous data#

20ab965eea8c45c798580514edbba834

This notebook serves as a tutorial for:

  • Loading regressors to TSDataset

  • Training and using a model with regressors

Table of Contents

  • What is regressor?

    • What is exogenous data?

  • Dataset

    • Loading Dataset

    • EDA

  • Forecast with regressors

1. What is regressor?#

In previous tutorials, we have shown how to work with target time series.

Target time series is a time series we want to forecast.

But imagine that you have information about the future that can help model with forecasting target time series. It may be information about holidays, weather, recurring events, marketing campaigns, etc. We will call it regressor.

Regressor is a time series that we are not interested in forecasting, however, it may help to forecast the target time series.

To apply an ML model that uses regressors to make more accurate forecasts, we need to know how regressors affected the target time series in the past and information about their values in the future.

What is additional data?#

There is also data that we don’t know in advance. However using it still allows us to make more accurate forecasts. This data we will call additional data. For example, if many users bought a new phone few weeks ago we should expect more support requests on this product.

In order to use additional data in ML models we should create regressors out of them. For example, it could be done with LagTransform or TrendTransform.

In this tutorial we will not look at additional data and will focus on regressors.

2. Dataset#

ETNA allows working with regressor as convenient as with target time series.

We are going to forecast the time series from Tabular Playground Series - Jan 2022. The dataset contains daily merchandise sales – mugs, hats, and stickers – at two imaginary store chains across three Scandinavian countries. As exogenous data, we will use Finland, Norway, and Sweden Weather Data 2015-2019 dataset containing daily country average precipitation, snow depth and air temperature data.

2.1 Loading Dataset#

First, let’s load the data.

[1]:
import pandas as pd
import warnings

warnings.filterwarnings("ignore")

target_df = pd.read_csv("data/nordic_merch_sales.csv")
regressor_df = pd.read_csv("data/nordics_weather.csv")

The next step is converting the data into the ETNA format. Code that allows us to do that is identical for target time series and exogenous data.

[2]:
from etna.datasets import TSDataset

target_df = TSDataset.to_dataset(target_df)
target_df.tail()
[2]:
segment Finland_KaggleMart_Kaggle Hat Finland_KaggleMart_Kaggle Mug Finland_KaggleMart_Kaggle Sticker Finland_KaggleRama_Kaggle Hat Finland_KaggleRama_Kaggle Mug Finland_KaggleRama_Kaggle Sticker Norway_KaggleMart_Kaggle Hat Norway_KaggleMart_Kaggle Mug Norway_KaggleMart_Kaggle Sticker Norway_KaggleRama_Kaggle Hat Norway_KaggleRama_Kaggle Mug Norway_KaggleRama_Kaggle Sticker Sweden_KaggleMart_Kaggle Hat Sweden_KaggleMart_Kaggle Mug Sweden_KaggleMart_Kaggle Sticker Sweden_KaggleRama_Kaggle Hat Sweden_KaggleRama_Kaggle Mug Sweden_KaggleRama_Kaggle Sticker
feature target target target target target target target target target target target target target target target target target target
timestamp
2018-12-27 573 414 177 1068 652 308 898 568 270 1604 1108 436 672 420 196 1127 745 319
2018-12-28 841 499 223 1398 895 431 1162 731 361 2178 1333 662 874 555 260 1540 990 441
2018-12-29 1107 774 296 1895 1398 559 1650 1113 518 2884 1816 874 1106 720 348 2169 1438 596
2018-12-30 1113 757 326 1878 1241 554 1809 1052 500 2851 1935 833 1133 730 336 2138 1303 587
2018-12-31 822 469 238 1231 831 360 1124 728 351 2128 1383 561 823 570 250 1441 1004 388

As you can see, the target ends in 2018, and the exogenous data ends in 2019, so we have prior information a year ahead. This implies that our exogenous data contains only regressors.

[3]:
regressor_df = TSDataset.to_dataset(regressor_df)
regressor_df.tail()
[3]:
segment Finland_KaggleMart_Kaggle Hat Finland_KaggleMart_Kaggle Mug ... Sweden_KaggleRama_Kaggle Mug Sweden_KaggleRama_Kaggle Sticker
feature precipitation snow_depth tavg tmax tmin precipitation snow_depth tavg tmax tmin ... precipitation snow_depth tavg tmax tmin precipitation snow_depth tavg tmax tmin
timestamp
2019-12-27 0.028249 109.550000 -8.529630 -3.161039 -10.895425 0.028249 109.550000 -8.529630 -3.161039 -10.895425 ... 0.105079 141.220930 -4.277778 -2.391204 -8.993458 0.105079 141.220930 -4.277778 -2.391204 -8.993458
2019-12-28 0.789266 116.421053 -9.107407 -4.703947 -15.288889 0.789266 116.421053 -9.107407 -4.703947 -15.288889 ... 1.117688 142.955224 -3.866667 -3.006542 -11.593056 1.117688 142.955224 -3.866667 -3.006542 -11.593056
2019-12-29 4.976966 117.117647 -0.418519 1.264052 -7.722078 4.976966 117.117647 -0.418519 1.264052 -7.722078 ... 1.758669 136.725146 1.755556 3.692056 -4.516204 1.758669 136.725146 1.755556 3.692056 -4.516204
2019-12-30 1.229775 160.500000 2.292593 3.344156 -0.202632 1.229775 160.500000 2.292593 3.344156 -0.202632 ... 0.561996 120.740741 4.900000 6.135648 1.859070 0.561996 120.740741 4.900000 6.135648 1.859070
2019-12-31 0.225281 124.647059 -2.859259 1.580519 -6.921569 0.225281 124.647059 -2.859259 1.580519 -6.921569 ... 0.848161 131.583333 1.722222 4.376606 -2.290278 0.848161 131.583333 1.722222 4.376606 -2.290278

5 rows × 90 columns

Then we have to create TSDataset with both target time series and exogenous data. TSDataset expects us to put target time series in df argument and exogenous data in df_exog. We should do it because regressors contain information about the target’s future. TSDataset ensures we don’t mix them.

In order to specify the columns of df_exog, which contains regressors, we need to use the known_future parameter. This allows TSDataset to determine which columns are regressors and which columns are additional data.

[4]:
ts = TSDataset(df=target_df, freq="D", df_exog=regressor_df, known_future="all")
ts.head()
[4]:
segment Finland_KaggleMart_Kaggle Hat Finland_KaggleMart_Kaggle Mug ... Sweden_KaggleRama_Kaggle Mug Sweden_KaggleRama_Kaggle Sticker
feature precipitation snow_depth target tavg tmax tmin precipitation snow_depth target tavg ... target tavg tmax tmin precipitation snow_depth target tavg tmax tmin
timestamp
2015-01-01 1.714141 284.545455 520.0 1.428571 2.912739 -1.015287 1.714141 284.545455 329.0 1.428571 ... 706.0 3.47 5.415354 0.221569 3.642278 84.924623 324.0 3.47 5.415354 0.221569
2015-01-02 10.016667 195.000000 493.0 0.553571 2.358599 -0.998718 10.016667 195.000000 318.0 0.553571 ... 653.0 3.80 5.097244 0.294882 2.414665 67.043702 293.0 3.80 5.097244 0.294882
2015-01-03 3.956061 284.294118 535.0 -1.739286 0.820382 -3.463871 3.956061 284.294118 360.0 -1.739286 ... 734.0 1.61 2.140392 -1.776680 0.212793 79.945946 319.0 1.61 2.140392 -1.776680
2015-01-04 0.246193 260.772727 544.0 -7.035714 -3.110828 -9.502581 0.246193 260.772727 332.0 -7.035714 ... 657.0 -1.35 -0.648425 -5.173123 0.226833 78.997290 300.0 -1.35 -0.648425 -5.173123
2015-01-05 0.036364 236.900000 378.0 -17.164286 -8.727564 -19.004487 0.036364 236.900000 243.0 -17.164286 ... 512.0 -4.27 -3.027451 -9.544488 0.515601 79.736148 227.0 -4.27 -3.027451 -9.544488

5 rows × 108 columns

2.2 EDA#

TSDataset joins exogenous data and the target time series, so we can visualize and analyze exogenous data in the same way as target time series. More information in EDA notebook.

[5]:
ts.plot(column="snow_depth", n_segments=2)
../_images/tutorials_exogenous_data_12_0.png
[6]:
ts.plot(column="precipitation", n_segments=2)
../_images/tutorials_exogenous_data_13_0.png
[7]:
ts.plot(column="target", n_segments=2)
../_images/tutorials_exogenous_data_14_0.png

3. Forecast with regressors#

We will use LinearPerSegmentModel. It is a simple model that works with regressors.

Note: some models do not work with regressors. In this case, they will warn you about it.

We should forecast merchandise sales a year ahead using regressors with information about weather.

[8]:
from etna.models import LinearPerSegmentModel

HORIZON = 365
model = LinearPerSegmentModel()

ETNA allows to configure the transforms to work with exogenous data the same way as they work with the time series. In addition to this, transforms will automatically update information about regressors in TSDataset.

[9]:
from etna.transforms import FilterFeaturesTransform

from etna.transforms import MeanTransform  # math
from etna.transforms import DateFlagsTransform, HolidayTransform  # datetime
from etna.transforms import LagTransform  # lags

transforms = [
    LagTransform(
        in_column="target",
        lags=list(range(HORIZON, HORIZON + 28)),
        out_column="target_lag",
    ),
    LagTransform(in_column="tavg", lags=list(range(1, 3)), out_column="tavg_lag"),
    MeanTransform(in_column="tavg", window=7, out_column="tavg_mean"),
    MeanTransform(
        in_column="target_lag_365",
        out_column="target_mean",
        window=104,
        seasonality=7,
    ),
    DateFlagsTransform(
        day_number_in_week=True,
        day_number_in_month=True,
        is_weekend=True,
        special_days_in_week=[4],
        out_column="date_flag",
    ),
    HolidayTransform(iso_code="SWE", out_column="SWE_holidays"),
    HolidayTransform(iso_code="NOR", out_column="NOR_holidays"),
    HolidayTransform(iso_code="FIN", out_column="FIN_holidays"),
    LagTransform(
        in_column="SWE_holidays",
        lags=list(range(2, 6)),
        out_column="SWE_holidays_lag",
    ),
    LagTransform(
        in_column="NOR_holidays",
        lags=list(range(2, 6)),
        out_column="NOR_holidays_lag",
    ),
    LagTransform(
        in_column="FIN_holidays",
        lags=list(range(2, 6)),
        out_column="FIN_holidays_lag",
    ),
    FilterFeaturesTransform(exclude=["precipitation", "snow_depth", "tmin", "tmax"]),
]

The next steps are literally identical to the situation when we work with target time series only.

[10]:
from etna.pipeline import Pipeline

pipeline = Pipeline(model=model, transforms=transforms, horizon=HORIZON)
[11]:
from etna.metrics import SMAPE

metrics, forecasts, _ = pipeline.backtest(ts, metrics=[SMAPE()], aggregate_metrics=True, n_folds=2)
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    3.2s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    6.5s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    6.5s finished
[12]:
metrics
[12]:
segment SMAPE
0 Finland_KaggleMart_Kaggle Hat 6.809976
1 Finland_KaggleMart_Kaggle Mug 7.897876
2 Finland_KaggleMart_Kaggle Sticker 7.566816
3 Finland_KaggleRama_Kaggle Hat 6.714908
4 Finland_KaggleRama_Kaggle Mug 7.443409
5 Finland_KaggleRama_Kaggle Sticker 7.540571
6 Norway_KaggleMart_Kaggle Hat 9.335215
7 Norway_KaggleMart_Kaggle Mug 11.929340
8 Norway_KaggleMart_Kaggle Sticker 11.455042
9 Norway_KaggleRama_Kaggle Hat 8.976252
10 Norway_KaggleRama_Kaggle Mug 11.691445
11 Norway_KaggleRama_Kaggle Sticker 11.594758
12 Sweden_KaggleMart_Kaggle Hat 6.837174
13 Sweden_KaggleMart_Kaggle Mug 7.319936
14 Sweden_KaggleMart_Kaggle Sticker 6.973164
15 Sweden_KaggleRama_Kaggle Hat 6.366672
16 Sweden_KaggleRama_Kaggle Mug 6.994042
17 Sweden_KaggleRama_Kaggle Sticker 7.081337
[13]:
from etna.analysis import plot_backtest

plot_backtest(forecasts, ts)
../_images/tutorials_exogenous_data_23_0.png

Supporting more work strategies for regressors and additional data is a future feature on the ETNA development roadmap.