---
title: "stackr"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{stackr}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r setup,echo=FALSE}
library("stackr")
```


# Introduction

The `stackr` package provides an easy way to combine predictions
from individual time series or panel data models to an 
ensemble. `stackr` stacks Models according to the Continuous Ranked Probability 
Score (CRPS) over k-step ahead predictions. It is therefore especially
suited for timeseries and panel data. A function for 
leave-one-out CRPS may be added in the future. Predictions need to be 
predictive distributions represented by predictive samples. Usually, these will 
be sets of posterior predictive simulation draws generated by an MCMC 
algorithm. 

Given some training data with true observed values as well as predictive samples
generated from different models, `crps_weights` finds the optimal (in the sense of 
minimizing expected cross-validation predictive error) weights 
to form an ensemble from these models. Using these weights, 
`mixture_from_samples` can then provide samples from the optimal 
model mixture by drawing from the predictice samples
of the individual models in the correct proportion. This gives a mixture model
solely based on predictive samples and is in this regard superior to other 
ensembling techniques like Bayesian Model Averaging. 
 
# Usage

Here is an example of how to use `stackr` with a toy data set. The data
contains true observed values as well as predictive samples generated by 
different models. 

### Load example data and split into train and test data
``` {r}
library("data.table")
splitdate <- as.Date("2020-03-28")
print(example_data)
```

``` {r}
traindata <- example_data[date <= splitdate]
testdata <- example_data[date > splitdate]
```


### Get weights and create mixture 
Obtain weights based on trainin data, create mixture based on predictive
samples in the testing data. 
``` {r}
weights <- stackr::crps_weights(traindata)
test_mixture <- stackr::mixture_from_samples(testdata, weights = weights)
print(test_mixture)
```

### Score predictions
Score predictions using the `scoringutils` package (install from CRAN using 
`install.packages("scoringutils")`). 
``` {r eval=FALSE}
# combine data.frame with mixture with predictions from other models
score_df <- data.table::rbindlist(list(testdata, test_mixture), fill = TRUE)


# CRPS score for all predictions using scoringutils
score_df[, crps := scoringutils::crps(unique(observed), t(predicted)),
  by = .(geography, model, date)
]

# summarise scores
score_df[, mean(crps), by = model][, data.table::setnames(.SD, "V1", "CRPS")]
```

Display other metrics
``` {r eval=FALSE}
data.table::setnames(
  score_df, c("date", "observed", "sample_id", "predicted"),
  c("id", "true_values", "sample", "predictions")
)
scoringutils::eval_forecasts(score_df[geography == "Tatooine"])
scoringutils::eval_forecasts(score_df[geography == "Coruscant"])
```


# Methods
Given a cumulative distribution function (CDF) with a finite first moment of a 
probabilistic forecast, 
the corresponding Continuous Ranked Probability Score can be written as
$$crps(F,y)=\mathbb{E}_X|X-y|- \frac{1}{2}\mathbb{E}_{X,X^\prime}|X-X^\prime|.\tag{1}$$

## CRPS for one model and one single observation 
The CRPS can be used to stack different models to obtain one ensemble mixture model. Let us 
assume we have data from $T$ time points $t = 1, \dots, T$ in $R$ regions $1, \dots, R$. 
Observations are denoted $y_{tr}$. Predictive samples are generated from $K$ different
models $k, \dots, K$. For every observation $y_{tr}$ the $S$ predictive samples 
are denoted $x_{1ktr}, \dots, x_{Sktr}$. 

Using these predictive draws, we can compute the CRPS of the $k$-th model for the 
observation $y_{tr}$ at time $t$ in region $r$ as

$$ \widehat {crps}_{ktr}= \widehat {crps}_(x_{1ktr}, \dots, x_{Sktr},y_{tr})= \frac{1}{S} \sum_{s=1}^S  |x_{sktr}-y_{tr}| -
\frac{1}{2S^2} \sum_{s, j=1}^S |x_{sktr}- x_{jktr}|. \tag{2}$$

## CRPS for a mixture of all models and one single observation 
Now we want to aggregate predictions from these $K$ models. When the prediction is a mixture of the $K$
models with weights $w_1, \dots, w_s$, the CRPS can be expressed as

$$ \widehat {crps}_{agg, tr} (w_1, \dots, w_K) = \frac{1}{S} \sum_{k=1}^K w_k  \sum_{s=1}^S |x_{skt}-y_t| -
\frac{1}{2S^2}  (\sum_{k=1}^K   \sum_{k, k'=1 }^K w_k w_{k'}   \sum_{s, j=1}^S |x_{skt}- x_{jk't}| ). \tag{3}$$

This expression is quadratic in $w$. We only need to compute  $\sum_{s=1}^S |x_{skt}-y_{tr}|$, $\sum_{s, j=1}^S |x_{sktr}- x_{jktr}|$, and 
$\sum_{s, j=1}^S |x_{sktr}- x_{jk'tr}|$ for all $k, k'$ pairs once and store them for all weight values in the optimization. 


## Obtaining the weights that minimize CRPS
The CRPS for the mixture of all models for all observations can simply obtained by 
summing up the individual results from equation 3 over all regions and time points. 
However, we can also weight predictions from different time points and regions differently. 
This makes sense for example if we have very little data in the beginning or if
current older observations are less characteristic of the current and future 
dynamics. In this case we might want to downweight the early phase in the final 
model evaluation. Similarly, we might want to give more or less weight to 
certain regions. Mathematically we can introduce a time-varying weight 
$\lambda_1, \dots, \lambda_T$, e.g. $\lambda_t = 2-(1-t/T)^2$ to
penalize earler estimates. Likewise  we can introduce a region-specific 
weight $\tau_r$. 

To obtain the CRPS weights we finally solve a quadratic optimization:

\begin{align*}
 &\min_{w_1, \dots, w_K} \sum_{t=1}^T  \sum_{r=1}^R\lambda_t\tau_r  \widehat {crps}_{agg, tr} (w), \\
  &s.t. ~{0\leq w_1, \dots, w_K \leq 1, \sum_{k=1}^K w_k=1}. 
\end{align*}


# References
- Using Stacking to Average Bayesian Predictive Distributions, Yuling Yao, Aki Vehtari, Daniel Simpson, and Andrew Gelman, 2018, Bayesian Analysis 13, Number 3, pp. 917–1003 <doi:10.1214/17-BA1091>
- Strictly Proper Scoring Rules, Prediction,and Estimation, 
Tilmann Gneiting and Adrian E. Raftery, 2007, Journal of the American
Statistical Association, Volume 102, 2007 - Issue 477 <doi:10.1198/016214506000001437>
- Comparing Bayes Model Averaging and Stacking When Model Approximation Error Cannot be Ignored, 
Bertrand Clarke, 2003, Journal of Machine Learning Research 4
- Bayesian Model Weighting: The Many Faces of Model Averaging, 
Marvin Höge, Anneli Guthke and Wolfgang Nowak, 2020, Water, doi:10.3390/w12020309
- Bayesian Stacking and Pseudo-BMA weights using the loo package, 
Aki Vehtari and Jonah Gabry, 2019, https://mc-stan.org/loo/articles/loo2-weights.html