This vignette is partly based on this nowcasting example by Sebastian Funk and Sam Abbott.
In this case study we walk through a simple approach to jointly
estimating the effective reproduction number over time and the delay
from a positive test to this test being reported. For more on the
epinowcast
package see the documentation.
An epidemic is in progress. We want to know what the effective reproduction number is at the current time. This is the average number of secondary infections caused by a single infected individual at a given point in time and is a measure of the transmissibility of the infection in the population at that time.
Unfortunately, we can’t directly observe infections and the data we do observe suffers from reporting delays.
epinowcast
is still in development and so the following
case study has pain points and limitations that we are working on
improving. If you have any suggestions or feedback please get in touch
via the issues
page. We also welcome contributions via pull requests (see our contributing
guidelines.
See the package
documentation for guidance on installing the package and getting
setup with cmdstanr
(the backend used here for fitting
models).
As well as the epinowcast
package we will use the
following packages in this vignette.
data.table
for simple data
transformations but tools from the tidyverse or base R tools could just
as easily be used;purrr
to make use of the
partial()
function which allows us to partially specify a
function for later reuse. This is useful for specifying the model as we
can specify the parts of the model that are common to all models and
then specify the parts that are specific to each model when we call the
updated function;ggplot2
for plotting.In this case study, we use data sourced from the Robert Koch Institute via the Germany Nowcasting hub. The data represent hospitalisation counts by date of positive test and date of test report in Germany. These data have been used extensively for nowcasting and forecasting COVID-19 hospitalisations in Germany and are described in detail in Wolffram et al. (2023). We use these data as they are a good example of the type of data that are available in many settings. We first summarise the data,
summary(germany_covid19_hosp)
#> reference_date location age_group confirm
#> Min. :2021-04-06 DE : 90405 00-04:219555 Min. : 0.00
#> 1st Qu.:2021-05-15 DE-BB : 90405 00+ :219555 1st Qu.: 0.00
#> Median :2021-06-23 DE-BE : 90405 05-14:219555 Median : 1.00
#> Mean :2021-06-25 DE-BW : 90405 15-34:219555 Mean : 11.44
#> 3rd Qu.:2021-08-02 DE-BY : 90405 35-59:219555 3rd Qu.: 6.00
#> Max. :2021-10-20 DE-HB : 90405 60-79:219555 Max. :1552.00
#> (Other):994455 80+ :219555
#> report_date
#> Min. :2021-04-06
#> 1st Qu.:2021-06-24
#> Median :2021-08-03
#> Mean :2021-07-31
#> 3rd Qu.:2021-09-11
#> Max. :2021-10-20
#>
In this case study we only consider national level data without
stratification by age group. We can filter these data using
data.table
as follows,
germany_hosp <- germany_covid19_hosp[location == "DE"][age_group == "00+"]
germany_hosp <- germany_hosp[, .(reference_date, report_date, confirm)]
germany_hosp
#> reference_date report_date confirm
#> <IDat> <Date> <int>
#> 1: 2021-04-06 2021-04-06 149
#> 2: 2021-04-07 2021-04-07 312
#> 3: 2021-04-08 2021-04-08 424
#> 4: 2021-04-09 2021-04-09 288
#> 5: 2021-04-10 2021-04-10 273
#> ---
#> 12911: 2021-07-27 2021-10-16 81
#> 12912: 2021-07-28 2021-10-17 159
#> 12913: 2021-07-29 2021-10-18 145
#> 12914: 2021-07-30 2021-10-19 117
#> 12915: 2021-07-31 2021-10-20 132
These data are already in a format that can be used with
epinowcast
, as it contains
reference_date
): the date of
the observation, in this example the date of a positive testreport_date
): the date of report
for a given set of observations by reference dateconfirm
): the total number of
hospitalisations by reference date and report date. Note that this is
cumulative by report date.The package also provides a range of tools to convert data from line list, incidence, or other common formats into the required format (see Data converters). For example we could convert the data to a individual case linelist,
germany_covid19_hosp_linelist <- germany_covid19_hosp |>
enw_add_incidence() |>
enw_incidence_to_linelist()
germany_covid19_hosp_linelist
#> id reference_date location age_group report_date delay
#> <int> <IDat> <fctr> <fctr> <IDat> <int>
#> 1: 1 2021-04-06 DE 00+ 2021-04-06 0
#> 2: 2 2021-04-06 DE 00+ 2021-04-06 0
#> 3: 3 2021-04-06 DE 00+ 2021-04-06 0
#> 4: 4 2021-04-06 DE 00+ 2021-04-06 0
#> 5: 5 2021-04-06 DE 00+ 2021-04-06 0
#> ---
#> 11486202: 11486202 2021-10-20 DE-TH 15-34 2021-10-20 115
#> 11486203: 11486203 2021-10-20 DE-TH 60-79 2021-10-20 117
#> 11486204: 11486204 2021-10-20 DE-TH 60-79 2021-10-20 117
#> 11486205: 11486205 2021-10-20 DE-TH 60-79 2021-10-20 117
#> 11486206: 11486206 2021-10-20 DE-TH 60-79 2021-10-20 117
This linelist could then be itself converted into the format
epinowcast
requires using the
enw_linelist_to_incidence()
function,
incidence_from_linelist <- enw_linelist_to_incidence(
germany_covid19_hosp_linelist,
reference_date = "reference_date",
report_date = "report_date",
by = c("age_group", "location"),
max_delay = 30
)
incidence_from_linelist
#> Key: <age_group, location>
#> age_group location report_date reference_date new_confirm confirm
#> <fctr> <fctr> <IDat> <IDat> <int> <int>
#> 1: 00-04 DE-BY 2021-04-06 <NA> 0 0
#> 2: 00-04 DE-BY 2021-04-07 <NA> 0 0
#> 3: 00-04 DE-BY 2021-04-08 <NA> 0 0
#> 4: 00-04 DE-BY 2021-04-09 <NA> 0 0
#> 5: 00-04 DE-BY 2021-04-10 <NA> 0 0
#> ---
#> 2049593: 80+ DE-TH 2021-10-19 2021-10-18 0 0
#> 2049594: 80+ DE-TH 2021-10-20 2021-10-18 0 0
#> 2049595: 80+ DE-TH 2021-10-19 2021-10-19 1 1
#> 2049596: 80+ DE-TH 2021-10-20 2021-10-19 0 1
#> 2049597: 80+ DE-TH 2021-10-20 2021-10-20 0 0
#> delay
#> <int>
#> 1: 0
#> 2: 1
#> 3: 2
#> 4: 3
#> 5: 4
#> ---
#> 2049593: 1
#> 2049594: 2
#> 2049595: 0
#> 2049596: 1
#> 2049597: 0
Here we have specified the reference_date
and
report_date
columns, the by
columns (which are
used to group the data), and the max_delay
which is the
maximum delay between the reference date and the report date (note that
the observed delay is longer than our specified delay and so is used
instead). The enw_linelist_to_incidence()
function will
then calculate the incidence by reference date and report date for each
group.
For this case study we will only consider data from the 1st of March
2020 onwards. As a first step, we filter the data to only include
observations from the 1st of May 2021 until the 1st of August 2021. We
do this using the enw_filter_report_dates()
and
enw_filter_reference_dates()
functions,
complete_germany_hosp <- germany_hosp |>
enw_filter_report_dates(latest_date = "2021-08-01") |>
enw_filter_reference_dates(earliest_date = "2021-05-01") |>
enw_complete_dates(missing_reference = FALSE) |>
enw_add_incidence()
Here we have also used enw_complete_dates()
to make sure
that we have complete data for all dates between the earliest and latest
dates. We have also used enw_add_incidence()
to add the
incidence column to the data.
Next we split the data into two parts, the data that was available at
the time and that will be used to fit the model
(rt_nat_germany
) and the data that was available
retrospectively about the same time period will be used to evaluate the
model (retro_nat_germany
). We again do this using the
enw_filter_report_dates()
and
enw_filter_reference_dates()
functions as follows:
rt_nat_germany
) by
filtering the data to only include reported observations from the 1st of
May 2021 until the 1st of July 2021;rt_germany <- complete_germany_hosp |>
enw_filter_report_dates(latest_date = "2021-07-01")
rt_germany
#> report_date reference_date confirm new_confirm delay
#> <IDat> <IDat> <int> <int> <int>
#> 1: 2021-05-01 2021-05-01 234 234 0
#> 2: 2021-05-02 2021-05-01 352 118 1
#> 3: 2021-05-03 2021-05-01 391 39 2
#> 4: 2021-05-04 2021-05-01 464 73 3
#> 5: 2021-05-05 2021-05-01 507 43 4
#> ---
#> 1949: 2021-06-30 2021-06-29 31 11 1
#> 1950: 2021-07-01 2021-06-29 35 4 2
#> 1951: 2021-06-30 2021-06-30 20 20 0
#> 1952: 2021-07-01 2021-06-30 37 17 1
#> 1953: 2021-07-01 2021-07-01 20 20 0
retro_germany
) by
filtering the data to only include observations with reference dates
(i.e. they may have been reported anytime up to the 1st of August 2021)
from the 1st of May 2021 until the 1st of July 2021;retro_germany <- complete_germany_hosp |>
enw_filter_reference_dates(latest_date = "2021-07-01")
retro_germany
#> report_date reference_date confirm new_confirm delay
#> <IDat> <IDat> <int> <int> <int>
#> 1: 2021-05-01 2021-05-01 234 234 0
#> 2: 2021-05-02 2021-05-01 352 118 1
#> 3: 2021-05-03 2021-05-01 391 39 2
#> 4: 2021-05-04 2021-05-01 464 73 3
#> 5: 2021-05-05 2021-05-01 507 43 4
#> ---
#> 3871: 2021-07-28 2021-07-01 57 0 27
#> 3872: 2021-07-29 2021-07-01 57 0 28
#> 3873: 2021-07-30 2021-07-01 57 0 29
#> 3874: 2021-07-31 2021-07-01 57 0 30
#> 3875: 2021-08-01 2021-07-01 57 0 31
Finally we can create a dataset that contains the latest data
available for each reference date. We do this using the
enw_latest_data()
function,
latest_germany_hosp <- retro_germany |>
enw_latest_data()
head(latest_germany_hosp, n = 10)
#> reference_date report_date confirm new_confirm delay
#> <IDat> <IDat> <int> <int> <int>
#> 1: 2021-05-01 2021-08-01 1007 0 92
#> 2: 2021-05-02 2021-08-01 781 0 91
#> 3: 2021-05-03 2021-08-01 467 0 90
#> 4: 2021-05-04 2021-08-01 823 0 89
#> 5: 2021-05-05 2021-08-01 1028 0 88
#> 6: 2021-05-06 2021-08-01 1016 0 87
#> 7: 2021-05-07 2021-08-01 892 0 86
#> 8: 2021-05-08 2021-08-01 822 0 85
#> 9: 2021-05-09 2021-08-01 561 0 84
#> 10: 2021-05-10 2021-08-01 364 0 83
Before we define, or fit, a model we should visualise the data to get
an idea of the trends in the data and its reporting structure. There is
currently no function in epinowcast
to visualise the data,
but we can use the ggplot2
package to do this manually as
follows,
gh_vis_cohorts <- copy(retro_germany)[
,
report_date := fcase(
report_date <= as.Date("2021-05-15"), as.Date("2021-05-15"),
report_date <= as.Date("2021-06-01"), as.Date("2021-06-01"),
report_date <= as.Date("2021-06-15"), as.Date("2021-06-15"),
report_date <= as.Date("2021-07-01"), as.Date("2021-07-01"),
report_date <= as.Date("2021-07-15"), as.Date("2021-07-15"),
report_date <= as.Date("2021-08-01"), as.Date("2021-08-01")
) |>
factor(levels = rev(c(
"2021-05-15", "2021-06-01", "2021-06-15", "2021-07-01",
"2021-07-15", "2021-08-01"
)))
]
gh_vis_cohorts_by_reference <- gh_vis_cohorts[,
.(confirm = sum(new_confirm)),
by = .(reference_date)
]
gh_vis_cohorts_by_ref_rep <- gh_vis_cohorts[,
.(confirm = sum(new_confirm)),
by = .(reference_date, report_date)
]
gh_vis_cohorts_by_ref_rep |>
ggplot() +
aes(
x = reference_date, y = confirm, fill = report_date, group = report_date
) +
geom_col(position = "stack", alpha = 1, col = "grey") +
geom_vline(
aes(xintercept = as.Date(report_date)),
linetype = 2, alpha = 0.9
) +
scale_y_continuous(labels = \(x)(scales::comma(x, accuracy = 1))) +
scale_fill_brewer(
palette = "Blues", aesthetics = c("color", "fill")
) +
theme_bw() +
labs(
x = "Date of positive test",
y = "Hospitalised cases by date of positive test",
fill = "Report date"
) +
guides(fill = guide_legend(reverse = TRUE)) +
theme(legend.position = "bottom")
The epinowcast
package provides a flexible framework for
modelling the incidence of infections and subsequent measures such as
hospitalisations that are themselves reported with some delay. The
framework is based on a generative model that describes the hypothesised
process by which the data are generated. The generative model is then
used to define a likelihood function that describes the probability of
observing the data given the model parameters. The likelihood function
is then combined with prior information used to fit the model parameters
to the data. In the following sections we describe the generative model
for the incidence of infections and hospitalisations. If you would like
to know more about the general framework see the model
vignette.
The first part of the generative process we consider is the expected number of hospitalisations by date positive test. This is the number of hospitalisations that we would expect to observe if we had perfect reporting of hospitalisations and there was no unmodelled variance. This part of the model is also decomposed into three parts: the expected number of infections, the delay from infection to positive test, and the probability of hospitalisation given an infection. These three parts are combined to give the expected number of hospitalisations by date positive test. We describe each of these parts in turn.
We model expected infections as a function of the effective (instantaneous) reproduction number and previous infections weighted by the generation time using the renewal equation[1,2]. This corresponds to the commonly used Susceptible-Exposed-Infected-Recovered (SEIR) model when the appropriate generation time is specified[3].
We model the instantaneous reproduction number (Rt) on the log scale as a weekly random walk as follows
where r0 is the intercept, and $\sum_{i = 0}^{t \% 7} \epsilon_i$ is a weekly random walk.
We choose this model because it is flexible, can be used to model a range of different scenarios, and is parsimonious. For example, if we set σϵ = 0 then the model reduces to a constant reproduction number. If we set σϵ > 0 then the model allows for changes in the reproduction number over time. Other sensible choices for the model include a random walk with a longer time scale (e.g. monthly) or shorter time scale (e.g. daily), a random walk with a drift term, or a random walk with a drift term and a seasonal component.
We model the expected number of infections/latent notifications
(λl) using
a renewal process[1,2]. This model is a generalisation
of the default epinowcast
model and implies that current
infections/notifications are dependent on past infections/notifications
based on a kernel. This kernel is defined by the generation time
distribution, the probability distribution of the time between infection
and subsequent infectiousness, and the previously defined model for the
effective reproduction number. The model is defined as follows,
We assume the generation time (g) is a gamma distribution with mean 4 days and a standard deviation of 3 days. To initialise the model we assume that the first P latent notifications are log-normally distributed with mean μg, tl and standard deviation σg, tl. This is equivalent to assuming that the first P latent notifications are independent and identically distributed. The mean of the log-normal distribution for each group is the log of the latest reported case count for the first reference date for that group scaled by the sum of the latent reporting delay. The standard deviation is assumed to be 1.
In many settings, such as in this case study, there may be additional
reporting delays on top of those that are directly observed in the data,
and therefore “nowcastable”, a common example is the delay from exposure
to symptom onset. For these settings we support modelling “latent”
reporting delays as a convolution of the underlying expected counts with
the potential for these delays to vary over time and by group. This
implementation is similar to that implemented in EpiNow2
and epidemia
as well as other similar models[4–7]. In addition to this, we
support modelling ascertainment through the use of improper probability
mass functions (i.e. by not enforcing a sum to 1 constraint) and
inferring ascertainment where possible (for example day of the week
reporting patterns).
Here we assume an infection to hospitalisation ratio of 2% modified by a random effect for the day of the week to account for within week reporting periodicity. We also assume a lognormal reporting delay with a mean of 5 days and a standard deviation of 2 days, representing the delay from infection to positive test. We assume that the reporting delay is the same for all days of the week. This model is defined as follows,
Where νi is a random effect for each day of the week which represents reporting periodicity.
epinowcast::enw_expectation()
We first need to define the weekly random walk model for the effective reproduction number. We can do this using the formula interface as follows,
Next we define the generation time distribution. As already discussed, we assume a gamma distribution with mean 4 days and a standard deviation of 3 days. These summary parameters first need to be transformed to the shape and scale parameters used for the gamma distribution. We then need to account for daily censoring and finally normalise the probability mass function (PMF) to account for right truncation.
# first transform mean and sd to shape and scale
gamma_mean <- 4
gamma_sd <- 3
gamma_shape <- gamma_mean^2 / gamma_sd^2
gamma_scale <- gamma_sd^2 / gamma_mean
# then discretise the distribution
gt_pmf <- simulate_double_censored_pmf(
max = 15, shape = gamma_shape, scale = gamma_scale, fun_dist = rgamma
) |>
# and normalise it to account for right truncation
(\(x) x / sum(x))()
plot(gt_pmf)
Here and for other PMFs we set the maximum delay to be 15 days as this gives good coverage of the majority of the distribution and limits the computational burden of fitting the model.
Now we define the latent reporting delay in a similar way remembering we specified a lognormal distribution with a mean of 5 days and a standard deviation of 2 days. We again also need to account for daily censoring and normalise the PMF to account for right truncation.
lgn_mean <- 5
lgn_sd <- 2
meanlog <- log(lgn_mean^2 / sqrt(lgn_sd^2 + lgn_mean^2))
sdlog <- sqrt(log(1 + lgn_sd^2 / lgn_mean^2))
# then discretise the distribution
reporting_pmf <- simulate_double_censored_pmf(
max = 15, meanlog = meanlog, sdlog = sdlog, fun_dist = rlnorm
) |>
# normalise it to account for right truncation
(\(x) x / sum(x))() |>
# and scale it by the infection to hospitalisation ration (2%)
(\(x) x * 0.02)()
plot(reporting_pmf)
The last part of the model to define is the remaining part of the ascertainment model for expected hospitalisations. As we have already defined the base ascertainment ratio (i.e. the 2% infection to hospitalisation ratio) we only need to define the day of the week random effect. We can do this using the formula interface as follows,
Finally, we can combine these four parts of the model to define the
expected hospitalisations by date of positive test. We do this using the
epinowcast::enw_expectation()
function as follows,
Given case counts both by date of reference and by date of report, we can estimate the reporting delay distribution directly and jointly with the underlying process model, rather than relying on external estimates from other sources (though we may want to account for external information in our priors).
We consider the reporting delay to follow a LogNormal(μd, σd) distribution, with parameters
This distribution is discretised into daily probabilities pt, d
and adjusted for the maximum delay, see
vignette("distributions")
for details.
Expected notifications by date of positive test (t) and reporting date can now be found by multiplying expected final notifications by date of positive test for each t with the probability of reporting for each day of delay (pt, d). We assume a Poisson observation model and produce a nowcast of final observed hospitalisations at each reference date by summing posterior estimates for unobserved notification and observed notifications for that reference date.
Before fitting the model we have just defined we need to preprocess
the data in order for it to be in the correct format to work with
epinowcast
and to produce metadata that describes aspects
of the data we use in the model (for example the maximum delay, the
number of groups, and the number of observations in each group). We do
this using the enw_preprocess_data()
function as follows
for both the real-time and retrospective datasets. Note that we set
max_delay = 30
to constrain the modelled maximum delay to
30 days. This is a pragmatic choice to ensure that the model can be fit
in a reasonable amount of time, but we could also set this to be longer
if we wanted to and if the data suggested a longer delay was
possible.
rt_germany_pobs <- enw_preprocess_data(rt_germany, max_delay = 30)
rt_germany_pobs
#> obs new_confirm latest
#> <list> <list> <list>
#> 1: <data.table[1425x8]> <data.table[1425x9]> <data.table[62x8]>
#> missing_reference reporting_triangle metareference
#> <list> <list> <list>
#> 1: <data.table[0x4]> <data.table[62x32]> <data.table[62x8]>
#> metareport metadelay max_delay time snapshots by
#> <list> <list> <num> <int> <int> <list>
#> 1: <data.table[91x10]> <data.table[30x5]> 30 62 62
#> groups max_date timestep
#> <int> <IDat> <char>
#> 1: 1 2021-07-01 day
and do the same for the retrospective data
retro_germany_pobs <- enw_preprocess_data(retro_germany, max_delay = 30)
retro_germany_pobs
#> obs new_confirm latest
#> <list> <list> <list>
#> 1: <data.table[1860x8]> <data.table[1860x9]> <data.table[62x8]>
#> missing_reference reporting_triangle metareference
#> <list> <list> <list>
#> 1: <data.table[0x4]> <data.table[62x32]> <data.table[62x8]>
#> metareport metadelay max_delay time snapshots
#> <list> <list> <num> <int> <int>
#> 1: <data.table[120x10]> <data.table[30x5]> 30 62 62
#> by groups max_date timestep
#> <list> <int> <IDat> <char>
#> 1: 1 2021-07-30 day
epinowcast
modelBefore we are ready to fit the model we need to first specify some
fitting options for using cmdstanr
(more information on
these options can be found in the cmdstanr
documentation).
We use 2 chains with 2 threads per chain (so using 4 CPU cores in total)
500 warmup samples and 1000 iterations per chain. As this is a complex
model we set adapt_delta
to 0.98 (the default is 0.8) so
that the Hamiltonian Monte Carlo sampler can explore the posterior
distribution more efficiently. For the same reason, we have also set the
max_treedepth
to 15 (the default is 10). Finally, we set
save_warmup
to FALSE
to save space (we don’t
need the warmup samples for this analysis), and pp
to
TRUE
so that we can use posterior predictive checks to
assess the model fit.
fit_module <- partial(enw_fit_opts,
chains = 2,
parallel_chains = 2,
threads_per_chain = 2,
iter_warmup = 500,
iter_sampling = 1000,
adapt_delta = 0.98,
max_treedepth = 15,
save_warmup = FALSE,
pp = TRUE,
show_messages = FALSE, # set this to TRUE to show fitting messages
refresh = 0 # remove this to show fitting progress
)
If you have more than 4 CPU cores available you can increase the
number of threads per chain to make use of these additional cores. For
example if you have 8 CPU cores available you could set
threads_per_chain = 4
to use all 8 cores (as there are 2
chains).
We also need to compile the model using cmdstanr
before
fitting it (we only have to do this once after installing the package).
In order for this line to work you should have followed the installation
guide in the README. This
will take around a minute to run when called for the first time.
Now we are ready to bring together all the modules we have specified, combine them with the model we have just compiled, and fit our synthetic data. We first fit to the data that would have been available in real-time,
germany_nowcast <- epinowcast(
data = rt_germany_pobs,
expectation = expectation_module(data = rt_germany_pobs),
reference = reference_module(data = rt_germany_pobs),
obs = obs_module(data = rt_germany_pobs),
fit = fit_module(),
model = epinowcast_model
)
#> Error in expectation_module(data = rt_germany_pobs): object 'rt_formula' not found
Fitting this model with the default options will take around 5 minutes to run on a laptop with 4 CPU cores. If you have more CPU cores available you can reduce the time it takes to fit the model by increasing the number of threads per chain (see the fitting options section above).
and then on the retrospectively observed data.
retro_germany_nowcast <- epinowcast(
data = retro_germany_pobs,
expectation = expectation_module(data = retro_germany_pobs),
reference = reference_module(data = retro_germany_pobs),
obs = obs_module(data = retro_germany_pobs),
fit = fit_module(),
model = epinowcast_model
)
#> Error in expectation_module(data = retro_germany_pobs): object 'rt_formula' not found
You may see a number of warning messages from cmdstanr
when fitting the model. These generally occur when the model is being
initialised and are due to this process not currently being optimised in
epinowcast
. The model should still fit correctly unless
these warnings continue into the fitting process (outside of the warmup
phase). This is something we are actively working on improving in future
versions of epinowcast
.
We first plot the nowcast based on real-time data. Effectively here we are trying to use our model to estimate the hospitalisations that will ultimately be reported based on the data that is currently available. We can compare this to the most recent observations to see how well the model is doing.
Internally, this plotting function (see
?plot.epinowcast()
) sums observed data and posterior
estimates for unobserved data to produce a nowcast of the number of
hospitalisations that will ultimately be reported.
In this instance, the model has done a good job of estimating the number of hospitalisations that will ultimately be reported based on the data that is currently available. This is not always the case, and the model can sometimes over or under estimate the number of hospitalisations that will ultimately be reported.
An important thing to note here is that this model is clearly not perfect with more complete data being less well estimated by the model. This likely indicates some misspecification in the reporting delay model which could be fixed by making it more complex or even by making it entirely non-parametric (vs being based on a LogNormal).
We can also plot the nowcast based on the retrospective data. As this plot uses observed data where available this is effectively plotting observed data after a 30 day delay (as this was the maximum we specified) compared to data that will ultimately be observed. This can be used to get a feel for how well specified our maximum delay is. Here we see that a small number of hospitalisations will be reported after 30 days. This will impact the performance of even an ideal model as it will not be able to predict these hospitalisations and so should be accounted for in the model evaluation.
To better understand the fit of the model to data we can instead plot
the posterior predictions for the observed data. This is effectively the
same as the plot above but instead of plotting the posterior predictions
for the unobserved data we plot the posterior predictions for the
observed data. This can be used to get a feel for how well the model is
able to predict the observed data. Rather than using the build in plot
function (by changing type = "posterior_prediction"
) we
instead use enw_plot_pp_quantiles()
so we can control the
number of references dates we plot (as otherwise the plot will be quite
overwhelming!).
plot_select_pp_dates <- function(nowcast, dates) {
nowcast |>
summary(type = "posterior_prediction") |>
(\(x) x[reference_date %in% dates])() |>
enw_plot_pp_quantiles() +
facet_wrap(vars(reference_date), scales = "free")
}
plot_select_pp_dates(
retro_germany_nowcast,
as.Date(c("2021-05-01", "2021-05-14", "2021-06-01", "2021-07-01"))
)
It is clear from this plot that for these example dates the model is in the main able to predict the observed data well. However, there are also some aspects of the data that the model is less well able to predict, in particular the apparent periodicity in the reporting delay. This is likely due to the model not being able to account for the day of the week reporting periodicity. This could be fixed by adding a day of the week random effect to the reporting delay model which we demonstrate in the package README.
A key output of the model is the posterior distribution of the effective reproduction number. We can plot this for both the real-time and retrospective data to see how the estimates change as more data becomes available. Ideally, we would hope that the real-time estimates would overlap with the retrospective estimates. This would indicate that our model is able to accurately estimate hospitalisations that will be ultimately reported based on those that have already been reported.
get_rt_posterior <- function(nowcast, expectation = expectation_module) {
rt <- enw_posterior(nowcast$fit[[1]], variables = "r")
cols <- c("mean", "median", "q5", "q20", "q80", "q95")
rt[, (cols) := lapply(.SD, exp), .SDcols = cols]
rt <- cbind(
expectation(data = nowcast)$data_raw$r[, .(date)], rt
)
return(rt)
}
rt <- rbind(
get_rt_posterior(germany_nowcast)[, Data := "Real-time"],
get_rt_posterior(retro_germany_nowcast)[, Data := "Retrospective"]
)
ggplot(rt) +
aes(x = date, col = Data, fill = Data) +
geom_line(aes(y = median), linewidth = 1, alpha = 0.6) +
geom_line(aes(y = mean), linetype = 2) +
geom_ribbon(aes(ymin = q5, ymax = q95), alpha = 0.2, linewidth = 0.2) +
geom_ribbon(aes(ymin = q20, ymax = q80, col = NULL), alpha = 0.2) +
geom_hline(yintercept = 1, linetype = 2) +
scale_y_continuous(trans = "log") +
scale_fill_brewer(palette = "Dark2", aesthetics = c("color", "fill")) +
theme_bw() +
labs(
x = "Date of infection",
y = "Effective reproduction number"
) +
theme(legend.position = "bottom")
Here we see good agreement between the real-time and retrospective estimates of the effective reproduction number for dates earlier in time as we would expect given that this data should be near complete in both models. Closer to the date of estimation we see that the real-time model does a good job estimating the trend shown by the retrospective model but has a tendency to overpredict. This could be caused by changes in the reporting delay over time, by the model not being able to account for the day of the week reporting periodicity, or by the model not being able to account for the change in the underlying process over time.
We can also plot the posterior distribution of the delay from testing positive to hospitalisation. This is the delay that is used to estimate the number of hospitalisations that will ultimately be reported based on the number of positive tests that have been reported. We can plot this for both the real-time and retrospective data to see how the estimates change as more data becomes available. Ideally, we would hope that the real-time estimates would overlap with the retrospective estimates. This would indicate that our model is able to accurately estimate hospitalisations that will be ultimately reported based on those that have already been reported.
extract_epinowcast_cdf <- function(nowcast) {
draws <- nowcast |>
(\(x)
x$fit[[1]]$draws(variables = c("refp_mean", "refp_sd"), format = "df")
)() |>
as.data.table()
draws[
,
cdf := purrr::map2(
`refp_mean[1]`, `refp_sd[1]`,
~ data.table(
delay = 1:30, cdf = plnorm(1:30, .x, .y) / plnorm(30, .x, .y)
)
)
]
draws <- draws[, rbindlist(cdf)]
draws <- draws[,
.(
mean = mean(cdf),
lower_90 = quantile(cdf, probs = 0.05),
upper_90 = quantile(cdf, probs = 0.95)
),
by = "delay"
]
}
nowcast_cdf <- list(
"Real-time" = germany_nowcast,
"Retrospective" = retro_germany_nowcast
) |>
map(extract_epinowcast_cdf) |>
rbindlist(idcol = "Data")
ggplot(nowcast_cdf) +
aes(x = delay, y = mean, col = Data, fill = Data) +
geom_line(size = 1.1, alpha = 0.7) +
geom_ribbon(
aes(ymin = lower_90, ymax = upper_90),
alpha = 0.25
) +
theme_bw() +
theme(legend.position = "bottom") +
scale_fill_brewer(palette = "Dark2", aesthetics = c("color", "fill")) +
guides(
fill = guide_legend(nrow = 1),
col = guide_legend(nrow = 1)
) +
labs(
x = "Delay between positive test and hospitalisation",
y = "Cumulative density function of the reporting distribution"
)
The estimates are a fairly good match indicating that the delay is being well estimated by the real-time model compared to the retrospective model and that the delay is unlikely to be changing rapidly over time in this example (as otherwise we would expect the estimated delays to diverge over time).
We can also plot the posterior predictions for the number of expected hospitalisations. This is the number of hospitalisations that would be reported if there was no observation error. We can compare this to the number of hospitalisations that are ultimately reported to see how well the model is doing.
Retrospectively, we would expect the model to fit very well to this data with any differences being due to the observation error. If this is not the case it indicates we may need to revisit our expected hospitalisation model specification to make it closer to the generative process of the data we have.
In real-time, we would expect the model to fit less well to this data as it is trying to estimate the number of hospitalisations that will ultimately be reported based on the data that is currently available (i.e. it also has to account for the reporting delay which is now partially unobserved).
get_expected_infections <- function(nowcast, expectation = expectation_module) {
exp_cases <- enw_posterior(
nowcast$fit[[1]],
variables = "exp_lobs"
)
cols <- c("mean", "median", "q5", "q20", "q80", "q95")
exp_cases[, (cols) := lapply(.SD, exp), .SDcols = cols]
exp_cases <- cbind(
expectation(data = nowcast)$data_raw$observation,
exp_cases
)
return(exp_cases)
}
exp_cases <- rbind(
get_expected_infections(germany_nowcast)[, Data := "Real-time"],
get_expected_infections(retro_germany_nowcast)[, Data := "Retrospective"]
)
exp_cases <- enw_latest_data(germany_hosp)[, date := reference_date][
exp_cases,
on = "date"
]
ggplot(exp_cases) +
aes(x = date, fill = Data, col = Data) +
geom_point(aes(y = confirm), col = "Black") +
geom_line(aes(y = median), linewidth = 1, alpha = 0.6) +
geom_line(aes(y = mean), linetype = 2) +
geom_ribbon(aes(ymin = q5, ymax = q95), alpha = 0.2, linewidth = 0.2) +
geom_ribbon(aes(ymin = q20, ymax = q80, col = NULL), alpha = 0.2) +
theme_bw() +
labs(
x = "Date of positive test",
y = "Expected hospitalisations"
) +
scale_fill_brewer(palette = "Dark2", aesthetics = c("color", "fill")) +
theme(legend.position = "bottom")
As expected the retrospective estimates fit very well to the data. The real-time estimates are slightly less good, but still capture the recent change in trend in the data which would not be possible without a nowcast (as estimates would be biased towards underpredicting due to the delay in reporting leading to right truncation).
In this case study we have shown how to use epinowcast
to estimate the effective reproduction number and the expected number of
latent and reported cases from right truncated data.
We have also shown how to use the package to perform retrospective and real-time nowcasts which can be a useful way of understanding the real-time performance of the model.
epinowcast
.epinowcast
.epinowcast
developed by members of the
epinowcast
community. It is a flexible toolset for
real-time analysis of infectious diseases. It is less complex than
epinowcast
with a focus on robust default settings but is
also less flexible. Over time its functionality will be incorporated
into epinowcast
.EpiNow2
, in a similar way to epinowcast
,
but is potentially more difficult to use. It also generally has less
functionality for dealing with delays than epinowcast
.
However, as it is an extension of rstanarm
it comes with a
number of useful features and a familiar interface for users of
rstanarm