-
Notifications
You must be signed in to change notification settings - Fork 0
TimePartitioning
ViEWS use a time-partitioning scheme that splits the available data into three partitions/periods: training, calibration, and testing/forecasting. The time periods for these partitions are defined based on the time stamps for the observed outcomes. The approach is described in-depth in Appendix A of the Hegre et al. (2020).
τ refers to calendar time, 4 but we add subscripts to identify when the partitions start and end. Because the partitions differ between evaluation and true forecasting, we have also added the superscript e to all notations of our evaluation partitions. The periodization table below shows the partitioning of data for estimating model weights, hyper-parameter tuning, evaluation, and forecasting.
- The "forecast" periodization is for actual forecasting
- The "evaluation" periodization is for testing models and ensembles.
- The training periods are used to train the models
- The calibration periods are used for hyper-parameter tuning and to estimate model weights.
After calibration EBMA, and hyper-parameter tuning, we retrain our models using both the training and calibration partitions
1. define the partitioning scheme
partitioner = data_partitioner.DataPartitioner.from_legacy_periods([
legacy.Period("A",
train_start=121,train_end=396,
predict_start=397,predict_end=432)
])
2. Apply the partitioner
training_a = partitioner("A","train",hh_data_model)
print(training_a.index.get_level_values(0)[[0,-1]])
3. Train the model
from stepshift import views
from sklearn.ensemble import RandomForestRegressor
mdl = views.StepshiftedModels(
RandomForestRegressor(),
[*range(1,4)],
"ln_ged_sb_dep")
4. Generate the predictions
predictions = mdl.predict(partitioner("A","predict",hh_data))
The resulting object contains rows starting at predict_start=397 and ending at predict_end=432
time | unit | step_pred_1 | step_pred_2 | step_pred_3 | step_combined |
---|---|---|---|---|---|
397 | 530 | 3.248208 | 0.068794 | 0.071361 | 3.248208 |
398 | 530 | 0.060043 | 3.139737 | 0.071361 | 3.139737 |
399 | 530 | 0.060043 | 0.068794 | 3.100956 | 3.100956 |
400 | 530 | 0.060043 | 0.068794 | 0.071361 | NaN |
401 | 530 | 0.060043 | 0.068794 | 0.071361 | NaN |
402 | 530 | 0.060043 | 0.068794 | 0.071361 | NaN |
403 | 530 | 0.060043 | 0.068794 | 0.071361 | NaN |
404 | 530 | 1.534400 | 0.068794 | 0.071361 | NaN |
405 | 530 | 0.060043 | 1.540886 | 0.071361 | NaN |
406 | 530 | 0.060043 | 0.068794 | 1.649546 | NaN |
407 | 530 | 0.060043 | 0.068794 | 0.071361 | NaN |
408 | 530 | 3.075878 | 0.068794 | 0.071361 | NaN |
409 | 530 | 3.255441 | 2.440132 | 0.071361 | NaN |
410 | 530 | 0.060043 | 3.509400 | 2.962537 | NaN |
411 | 530 | 0.060043 | 0.068794 | 3.711153 | NaN |
412 | 530 | 0.060043 | 0.068794 | 0.071361 | NaN |
413 | 530 | 0.060043 | 0.068794 | 0.071361 | NaN |
414 | 530 | 0.060043 | 0.068794 | 0.071361 | NaN |
415 | 530 | 0.060043 | 0.068794 | 0.071361 | NaN |
416 | 530 | 1.748024 | 0.068794 | 0.071361 | NaN |
417 | 530 | 0.060043 | 1.963898 | 0.071361 | NaN |
418 | 530 | 0.060043 | 0.068794 | 1.900073 | NaN |
419 | 530 | 0.060043 | 0.068794 | 0.071361 | NaN |
420 | 530 | 0.060043 | 0.068794 | 0.071361 | NaN |
421 | 530 | 2.670264 | 0.068794 | 0.071361 | NaN |
422 | 530 | 0.060043 | 2.285300 | 0.071361 | NaN |
423 | 530 | 0.060043 | 0.068794 | 2.428119 | NaN |
424 | 530 | 0.060043 | 0.068794 | 0.071361 | NaN |
425 | 530 | 0.060043 | 0.068794 | 0.071361 | NaN |
426 | 530 | 0.969150 | 0.068794 | 0.071361 | NaN |
427 | 530 | 0.060043 | 1.005730 | 0.071361 | NaN |
428 | 530 | 0.060043 | 0.068794 | 1.022231 | NaN |
429 | 530 | 0.060043 | 0.068794 | 0.071361 | NaN |
430 | 530 | 0.060043 | 0.068794 | 0.071361 | NaN |
431 | 530 | 0.060043 | 0.068794 | 0.071361 | NaN |
432 | 530 | 0.060043 | 0.068794 | 0.071361 | NaN |
To retain the name conventions we have established the columns currently called step_pred_1, step_pred_2 etc should be called ss_1, ss_2, etc., or, if preferred, step_spec_1, step_spec_2. The column called step_combined is in line with convention (but were called sc in views2). The advantage with the two-letter abbreviation is that when we are generating ensembles, there will a large number of columns with different model-name prefix and then _ss_1 or _step_spec_1.
Throughout, the partition is defined in terms of the month of the actuals we are targeting, not in terms of the last month with data or the last month in the training set. The partial exception is the step_combined, which is defined both in terms of the last month in the training set and the month of the actual. Accordingly, the step_combined series starts at predict_start and ends at predict_start plus the number of steps in the call to StepshiftedModels.
The figures below are illustrations of the process from Hegre et al. (2020).
The following diagram shows predictions from a step = 1 model, which explains why there are leading missing values for predictions and lagging missing values for the independent variables.
The following diagram shows how stepshifting is used to predict into the future.
- Hegre H, Bell C, Colaresi M, et al. ViEWS2020: Revising and evaluating the ViEWS political Violence Early-Warning System. Journal of Peace Research. 2021;58(3):599-611. doi:10.1177/0022343320962157