scoringutils
This vignette gives an overview of the default scoring rules made
available through the scoringutils
package. You can, of
course, also use your own scoring rules, provided they follow the same
format. If you want to obtain more detailed information about how the
pacakge works, have a look at the revised
version of our scoringutils
paper.
We can distinguish two types of forecasts: point forecasts and probabilistic forecasts. A point forecast is a single number representing a single outcome. A probabilistic forecast is a full predictive probability distribution over multiple possible outcomes. In contrast to point forecasts, probabilistic forecasts incorporate uncertainty about different possible outcomes.
Scoring rules are functions that take a forecast and an observation as input and return a single numeric value. For point forecasts, they take the form S(ŷ, y), where ŷ is the forecast and y is the observation. For probabilistic forecasts, they usually take the form S(F, y), where F is the cumulative density function (CDF) of the predictive distribution and y is the observation. By convention, scoring rules are usually negatively oriented, meaning that smaller values are better (the best possible score is usually zero). In that sense, the score can be understood as a penalty.
Many scoring rules for probabilistic forecasts are so-called (strictly) proper scoring rules. Essentially, this means that they cannot be “cheated”: A forecaster evaluated by a strictly proper scoring rule is always incentivised to report her honest best belief about the future and cannot, in expectation, improve her score by reporting something else. A more formal definition is the following: Let G be the true, unobserved data-generating distribution. A scoring rule is said to be proper, if under G and for an ideal forecast F = G, there is no forecast F′ ≠ F that in expectation receives a better score than F. A scoring rule is considered strictly proper if, under G, no other forecast F′ in expectation receives a score that is better than or the same as that of F.
See a list of the default metrics for point forecasts by calling
get_metrics(example_point)
.
This is an overview of the input and output formats for point forecasts:
Scoring point forecasts can be tricky business. Depending on the choice of the scoring rule, a forecaster who is clearly worse than another, might consistently receive better scores (see Gneiting (2011) for an illustrative example).
Every scoring rule for a point forecast is implicitly minimised by a specific aspect of the predictive distribution. The mean squared error, for example, is only a meaningful scoring rule if the forecaster actually reported the mean of their predictive distribution as a point forecast. If the forecaster reported the median, then the mean absolute error would be the appropriate scoring rule. If the scoring rule and the predictive task do not align, misleading results ensue. Consider the following example:
set.seed(123)
n <- 1000
observed <- rnorm(n, 5, 4)^2
predicted_mu <- mean(observed)
predicted_not_mu <- predicted_mu - rnorm(n, 10, 2)
mean(Metrics::ae(observed, predicted_mu))
#> [1] 34.45981
mean(Metrics::ae(observed, predicted_not_mu))
#> [1] 32.54821
mean(Metrics::se(observed, predicted_mu))
#> [1] 2171.089
mean(Metrics::se(observed, predicted_not_mu))
#> [1] 2290.155
Observation: y, a real number
Forecast: ŷ, a real number, the median of the forecaster’s predictive distribution.
The absolute error is the absolute difference between the predicted
and the observed values. See ?Metrics::ae
.
ae = |y − ŷ|
The absolute error is only an appropriate rule if ŷ corresponds to the median of the forecaster’s predictive distribution. Otherwise, results will be misleading (see Gneiting (2011)).
Observation: y, a real number
Forecast: ŷ, a real number, the mean of the forecaster’s predictive distribution.
The squared error is the squared difference between the predicted and
the observed values. See ?Metrics::se
.
se = (y − ŷ)2 The squared error is only an appropriate rule if ŷ corresponds to the mean of the forecaster’s predictive distribution. Otherwise, results will be misleading (see Gneiting (2011)).
Observation: y, a real number
Forecast: ŷ, a real number
The absolute percentage error is the absolute percent difference
between the predicted and the observed values. See
?Metrics::ape
.
$$\text{ape} = \frac{|y - \hat{y}|}{|y|}$$
The absolute percentage error is only an appropriate rule if ŷ corresponds to the β-median of the forecaster’s predictive distribution with β = −1. The β-median, med(β)(F), is the median of a random variable whose density is proportional to yβf(y). The specific β-median that corresponds to the absolute percentage error is med(−1)(F). Otherwise, results will be misleading (see Gneiting (2011)).
See a list of the default metrics for point forecasts by calling
?get_metrics(example_binary)
.
This is an overview of the input and output formats for point forecasts:
Observation: y, either 0 or 1
Forecast: p, a probability that the observed outcome will be 1.
The Brier score is a strictly proper scoring rule. It is computed as the mean squared error between the probabilistic prediction and the observed outcome.
The Brier score and the logarithmic score (see below) differ in how they penalise over- and underconfidence (see Machete (2012)). The Brier score penalises overconfidence and underconfidence in probability space the same. Consider the following example:
n <- 1e6
p_true <- 0.7
observed <- factor(rbinom(n = n, size = 1, prob = p_true), levels = c(0, 1))
p_over <- p_true + 0.15
p_under <- p_true - 0.15
abs(mean(brier_score(observed, p_true)) - mean(brier_score(observed, p_over)))
#> [1] 0.0223866
abs(mean(brier_score(observed, p_true)) - mean(brier_score(observed, p_under)))
#> [1] 0.0226134
See ?brier_score()
for more information.
Observation: y, either 0 or 1
Forecast: p, a probability that the observed outcome will be 1.
The logarithmic score (or log score) is a strictly proper scoring rule. It is computed as the negative logarithm of the probability assigned to the observed outcome.
The log score penalises overconfidence more strongly than underconfidence (in probability space). Consider the following example:
abs(mean(logs_binary(observed, p_true)) - mean(logs_binary(observed, p_over)))
#> [1] 0.07169954
abs(mean(logs_binary(observed, p_true)) - mean(logs_binary(observed, p_under)))
#> [1] 0.04741833
See ?logs_binary()
for more information.
See a list of the default metrics for sample-based forecasts by
calling get_metrics(example_sample_continuous)
.
This is an overview of the input and output formats for quantile forecasts:
Observation: y, a real number (or a discrete number).
Forecast: A continuous (F) or discrete (P) forecast.
The continuous ranked probability score (CRPS) is popular in fields such as meteorology and epidemiology. The CRPS is defined as CRPS(F, y) = ∫−∞∞(F(x) − 1(x ≥ y))2dx, where y is the observed value and F the CDF of predictive distribution.
For discrete forecasts, for example count data, the ranked probability score (RPS) can be used instead and is commonly defined as: $$ \text{RPS}(P, y) = \sum_{x = 0}^\infty (P(x) - 1(x \geq y))^2, $$ where P is the cumulative probability mass function (PMF) of the predictive distribution.
The CRPS can be understood as a generalisation of the absolute error to predictive distributions (Gneiting and Raftery 2007). It can also be understood as the integral over the Brier score for the binary probability forecasts implied by the CDF for all possible observed values. The CRPS is also related to the Cramér-distance between two distributions and equals the special case where one of the distributions is concentrated in a single point (see e.g. Ziel (2021)). The CRPS is a global scoring rule, meaning that the entire predictive distribution is taken into account when determining the quality of the forecast.
scoringutils
re-exports the crps_sample()
function from the scoringRules
package, which assumes that
the forecast is represented by a set of samples from the predictive
distribution. See ?crps_sample()
for more information.
The CRPS can be interpreted as a sum of a dispersion, an overprediction and an underprediction component. If m is the median forecast then the dispersion component is CRPS(F, m), the overprediction component is $$ \begin{cases} m > y & CRPS(F, y) - CRPS(F, m)\\ m \leq y & 0\\ \end{cases} $$ and the underprediction component is $$ \begin{cases} m < y & CRPS(F, y) - CRPS(F, m)\\ m \geq y & 0\\ \end{cases} $$
These can be accessed via the dispersion_sample()
,
overprediction_sample()
and
underprediction_sample()
functions, respectively.
Observation: y, a real number (or a discrete number).
Forecast: A continuous (F) or discrete (P) forecast.
The logarithmic scoring rule is simply the negative logarithm of the density of the the predictive distribution evaluated at the observed value:
log score(F, y) = −log f(y),
where f is the predictive probability density function (PDF) corresponding to the Forecast F and y is the observed value.
For discrete forecasts, the log score can be computed as
log score(F, y) = −log py, where py is the probability assigned to the observed outcome y by the forecast F.
The logarithmic scoring rule can produce large penalties when the observed value takes on values for which f(y) (or py) is close to zero. It is therefore considered to be sensitive to outlier forecasts. This may be desirable in some applications, but it also means that scores can easily be dominated by a few extreme values. The logarithmic scoring rule is a local scoring rule, meaning that the score only depends on the probability that was assigned to the actual outcome. This is often regarded as a desirable property for example in the context of Bayesian inference . It implies for example, that the ranking between forecasters would be invariant under monotone transformations of the predictive distribution and the target.
scoringutils
re-exports the logs_sample()
function from the scoringRules
package, which assumes that
the forecast is represented by a set of samples from the predictive
distribution. One implications of this is that it is currently not
advisable to use the log score for discrete forecasts. The reason for
this is that scoringRules::logs_sample()
estimates a
predictive density from the samples, which can be problematic for
discrete forecasts.
See ?logs_sample()
for more information.
Observation: y, a real number (or a discrete number).
Forecast: F. The predictive distribution with mean μ and standard deviation σ.
The Dawid-Sebastiani score is a proper scoring rule that only relies on the first moments of the predictive distribution and is therefore easy to compute. It is given as
$$\text{dss}(F, y) = \left( \frac{y - \mu}{\sigma} \right)^2 + 2 \cdot \log \sigma.$$
scoringutils
re-exports the implementation of the DSS
from the scoringRules
package. It assumes that the forecast
is represented by a set of samples drawn from the predictive
distribution. See ?dss_sample()
for more information.
Observation: Not required.
Forecast: F, the predictive distribution.
Dispersion (also called sharpness) is the ability to produce narrow forecasts. It is a feature of the forecasts only and does not depend on the observations. Dispersion is therefore only of interest conditional on calibration: a very precise forecast is not useful if it is clearly wrong.
One way to measure sharpness (as suggested by Funk et al. (2019)) is the normalised median
absolute deviation about the median (MADN) ). It is computed as $$ S(F) = \frac{1}{0.675} \cdot \text{median}(|F -
\text{median(F)}|). $$ If the forecast F follows a normal distribution,
then sharpness will equal the standard deviation of F.For more details, see
?mad_sample()
.
Observation: y, a real number (or a discrete number).
Forecast: A continuous (F) or discrete (P) forecast.
Bias is a measure of the tendency of a forecaster to over- or
underpredict. For continuous forecasts, the
scoringutils
implementation calculates bias as B(F, y) = 1 − 2 ⋅ F(y),
where F(y) is the
cumulative distribution function of the forecast.
For discrete forecasts, we calculate bias as B(P, y) = 1 − (P(y) + P(y + 1)). where P(y) is the cumulative probability assigned to all outcomes smaller or equal to y, i.e. the cumulative probability mass function.
Bias is bound between -1 and 1 and represents the tendency of forecasts to be biased rather than the absolute amount of over- and underprediction (which is e.g. the case for the weighted interval score (WIS, see below).
Observation: y, a real number (or a discrete number).
Forecast: A forecast F
aemedian(F, y) = |median(F) − y|. See section A note of caution or Gneiting (2011) for a discussion on the correspondence between the absolute error and the median.
Observation: y, a real number (or a discrete number).
Forecast: A forecast F
semedn(F, y) = (mean(F) − y)2. See section A note of caution or Gneiting (2011) for a discussion on the correspondence between the squared error and the mean.
See a list of the default metrics for quantile-based forecasts by
calling get_metrics(example_quantile)
.
This is an overview of the input and output formats for quantile forecasts:
Observation: y, a real number
Forecast: F. The CDF of the predictive distribution is represented by a set of quantiles. These quantiles form the lower (l) and upper (u) bounds of central prediction intervals.
The weighted interval score (WIS) is a strictly proper scoring rule and can be understood as an approximation of the CRPS for forecasts in a quantile format (which in turn represents a generalisation of the absolute error). Quantiles are assumed to be the lower and upper bounds of prediction intervals symmetric around the median. For a single interval, the interval score is
$$IS_\alpha(F,y) = \underbrace{(u-l)}_\text{dispersion} + \underbrace{\frac{2}{\alpha} \cdot (l-y) \cdot \mathbf{1}(y \leq l)}_{\text{overprediction}} + \underbrace{\frac{2}{\alpha} \cdot (y-u) \cdot \mathbf{1}(y \geq u)}_{\text{underprediction}}, $$
where 1() is the indicator function, and l and u are the $\frac{\alpha}{2}$ and $1 - \frac{\alpha}{2}$ quantiles of the predictive distribution F. l and u together form the prediction interval. The interval score can be understood as the sum of three components: dispersion, overprediction and underprediction.
For a set of K prediction intervals and the median m, the score is given as a weighted sum of individual interval scores, i.e.
$$WIS = \frac{1}{K + 0.5} \cdot \left(w_0 \cdot |y - m| + \sum_{k = 1}^{K} w_k \cdot IS_{\alpha_{k}}(F, y)\right),$$ where m is the median forecast and wk is a weight assigned to every interval. When the weights are set to $w_k = \frac{\alpha_k}{2}$ and w0 = 0.5, then the WIS converges to the CRPS for an increasing number of equally spaced quantiles.
See ?wis()
for more information.
These are the individual components of the WIS. See
?overprediction_quantile()
,
?underprediction_quantile()
and
?dispersion_quantile()
for more information.
Observation: y, a real number
Forecast: F. The CDF of the predictive distribution is represented by a set of quantiles, Q.
Bias can be measured as
where qα is the α-quantile of the predictive distribution. For consistency, we define Q (the set of quantiles that form the predictive distribution F) such that it always includes the element q0 = −∞ and q1 = ∞. In clearer terms, bias is:
Bias can assume values between -1 (underprediction) and 1 (overpredictin) and is 0 ideally (i.e. unbiased).
For an increasing number of quantiles, the percentile rank will equal the proportion of predictive samples below the observed value, and the bias metric coincides with the one for continuous forecasts (see above).
See ?bias_quantile()
for more information.
Observation: y, a real number
Forecast: F. The CDF of the predictive distribution is represented by a set of quantiles. These quantiles form central prediction intervals.
Interval coverage for a given interval range is defined as the proportion of observations that fall within the corresponding central prediction intervals. Central prediction intervals are symmetric around the median and formed by two quantiles that denote the lower and upper bound. For example, the 50% central prediction interval is the interval between the 0.25 and 0.75 quantiles of the predictive distribution.
The interval coverage deviation is the difference between the observed interval coverage and the nominal interval coverage. For example, if the observed interval coverage for the 50% central prediction interval is 0.6, then the interval coverage deviation is 0.6 = −0.5 = 0.1.
interval coverage deviation = observed interval coverage − nominal interval coverage
Observation: y, a real number
Forecast: F. The CDF of the predictive distribution is represented by a set of quantiles.
The absolute error of the median is the absolute difference between the median of the predictive distribution and the observed value.
aemedian = |median(F) − y| See section A note of caution or Gneiting (2011) for a discussion on the correspondence between the absolute error and the median.
Observation: y, a real number
Forecast: F. The CDF of the predictive distribution is represented by a set of quantiles.
The quantile score, also called pinball loss, for a single quantile level τ is defined as
with qτ being the τ-quantile of the predictive distribution F, and 1(⋅) the indicator function.
The (unweighted) interval score (see above) for a 1 − α prediction interval can be computed from the quantile scores at levels α/2 and 1 − α/2 as
$$\text{IS}_\alpha(F, y) = \frac{\text{QS}_{\alpha/2}(F, y) + \text{QS}_{1 - \alpha/2}(F, y)}{\alpha}$$.
The weighted interval score can be obtained as a simple average of the quantile scores:
$$\text{WIS}_\alpha(F, y) = \frac{\text{QS}_{\alpha/2}(F, y) + \text{QS}_{1 - \alpha/2}(F, y)}{2}$$.
See ?quantile_score
and Bracher
et al. (2021) for more details.
Quantile coverage for a given quantile level is defined as the proportion of observed values that are smaller than the corresponding predictive quantiles. For example, the 0.5 quantile coverage is the proportion of observed values that are smaller than the 0.5-quantiles of the predictive distribution.