Model definition and implementation

Introduction

The epinowcast package aims to be a modular toolbox for real-time infectious disease surveillance both in outbreak and routine contexts. As such we provide a modular modelling framework that is optimised for a range of common surveillance tasks whilst maintaining flexibility and supporting user extension.

We provide a flexible semi-parametric model for the underlying generative process similar to that implemented in other real-time infectious disease modelling packages[1,2]. This optionally includes a renewal process[3,4] and latent reporting process[1,5,6]. Combined with the appropriate generation time distribution this approach has been shown to correspond to a Susceptible-Exposed-Infected-Recovered (SEIR) model[7] with the addition of reporting delays. However, our default model contains minimal mechanism in order to more flexibly fit highly informative data.

Our nowcasting approach is an extension of that proposed by Günther et al.[8] which was itself an extension of the model proposed by Höhle and Heiden[9] and implemented in the surveillance R package[10]. Compared to the model proposed in Günther et al.[8], epinowcast adds support for jointly nowcasting multiple related datasets, a flexible formula interface allowing for the specification of a large range of models, and an optional parametric assumption for the underlying reporting delay.

We also support flexible joint modelling of missing data by assuming that the reporting delay is consistent between reported and unreported observations following the methodology of Lison et al.[11].

Our modelling framework is implemented in the stan probabilistic programming language via cmdstanr[12,13] with a focus on computational efficiency and robustness.

In the following sections we outline our modelling methodology, current feature set stratified by module, and highlight implementation details. For each module we present both our default implementation as well as the more generic framework we support. Note that extensions to this methodology are warmly welcomed with our community site being a good first point of contact.

Decomposition into expected final notifications and report delay components

We are concerned with outcomes that occur at a time of reference (e.g., date of symptom onset or test for a disease) that are reported only with a delay, at the time of report (e.g., the date onsets are entered into a central database and so become available for analysis). We assume that these times are measured in discrete time steps, usually of a day (in which case the times are dates).

We follow the approach of Höhle and Heiden[9] and consider the distribution of notifications (ng, t, d) by date of reference (t) and reporting delay (d) conditional on the final observed count Ng, t for each dataset (g) such that,

where D represents the maximum delay between date of reference and time of report which in theory could be infinite but in practice we set to a finite value in order to make the model identifiable and computationally feasible. For each t and g these notifications are assumed to be drawn from a multinomial distribution with Ng, t trials and a probability vector (pg, t, d) of length D. The aim of nowcasting is to predict the final observed counts Ng, t given information available up to time t. We do this by estimating the components of this probability vector jointly with the expected number of final notifications (λg, t = 𝔼[Ng, t]) in dataset g at time t.

An alternative approach would be to consider each ng, t, d independently at which point the model can be defined as a regression that can be fit using standard software with the appropriate observation model and adjustment for reporting delay (i.e., it becomes a Poisson or Negative Binomial regression). An implementation of this approach is available in Bastos et al.[14]. A downside of this simplified approach is that reporting is not conditionally dependent which may make specifying models for complex reporting distributions difficult.

Expected final notifications

Default model

Here we follow the approach of Günther et al.[8] and specify the model for expected final notifications as a first order random walk. This simple model is highly flexible and so a good fit for nowcasting problems where the data is highly informative.

where Ng0, the first time point for expected observations in dataset d, is assumed to have been completely observed.

Generalised model

In settings where data is sparse or where users want to understand the underlying generative process our flexible default model is likely not a good choice. In these settings our generic model offers a range of options that are context specific. Our generic model is currently based on a renewal process[3,4] with additional latent reporting delays[1,5,6]. As previously noted[7], this corresponds to the commonly used Susceptible-Exposed-Infected-Recovered (SEIR) model when appropriate generation time is specified[7]

Instantaneous reproduction number/growth rate

We model the instantaneous reproduction number (Rt) on the log scale (though support for other link functions is planned). When the generation time is fixed to be a day this can be interpreted as the instantaneous growth rate (rt) defined as the difference in the log of the expected number of final notifications between time t and t − 1.

where r0 is the optional intercept, Xr is the design matrix for fixed effects (βf, r), and Zr is the design matrix for random effects (βr, r. Within this specification the default model can be specified as a random effect on the day with no intercept. Alternative specifications that may be of interest include a weekly random walk (specified as ~ 1 + rw(week)), a piecewise linear model (specified as ~ 1 + day:week), and a group specific random effect (specified as ~ 1 + (1 | .group)).

Latent infections/notifications

We model the expected number of infections/latent notifications (λl) using a renewal process[3,4]. This model is a generalisation of the default model and can be used to model the expected number of latent notifications in a setting where the generation time is not fixed to be a day. It implies that current infections/notifications are dependent on past infections/notifications based on a kernel (usually interpreted as the generation time or serial interval). An instantaneous daily growth rate model can be recovered by setting the generation time to be fixed at 1 day. The model is defined as follows,

Where Gg(p, t − p) is the probability of an infection p days after infection t − p days ago for group g, and P is the maximum generation time. To initialise the model we assume that the first P latent notifications are log-normally distributed with mean μg, tl and standard deviation σg, tl. This is equivalent to assuming that the first P latent notifications are independent and identically distributed. The mean of the log-normal distribution for each group is the log of the latest reported case count for the first reference date for that group scaled by the sum of the latent reporting delay. The standard deviation is assumed to be 1. Both of these assumptions can be altered by the user.

Latent reporting delay and ascertainment

In some settings there may be additional reporting delays on top of those that are directly observed in the data, and therefore “nowcastable”, a common example is the delay from exposure to symptom onset. For these settings we support modelling “latent” reporting delays as a convolution of the underlying expected counts with the potential for these delays to vary over time and by group. This implementation is similar to that implemented in EpiNow2 and epidemia as well as other similar models[1,5,6,11]. In addition to this we support modelling ascertainment through the use of improper probability mass functions (i.e., by not enforcing a sum to 1 constraint) and inferring ascertainment where possible (for example day of the week reporting patterns).

Where νg, t is the inferred ascertainment and is modelled flexibly using an optional intercept (ν0), a design matrix (Xν) for fixed effects (βf, ν), and a design matrix (Zν) for random effects (βr, ν.

Delay distribution

Given case counts both by date of reference and by date of report, we can estimate the reporting delay distribution directly and jointly with the underlying process model, rather than relying on external estimates from other sources (though we may want to account for external information in our priors). In the following section, we describe our default parametric delay distribution model and its extension into a generic, highly flexible delay model based on discrete time-to-event modelling.

Default model

In our default model, we consider the reporting delay to follow a LogNormal(μd, σd) distribution, with parameters

This distribution is discretised into daily probabilities pg, t, d (for a case in group g with reference date t to be reported with delay d) and adjusted for the maximum delay, see vignette("distributions") for details.

Generalised model

We generalise this model in order to support a range of delay distributions as well as effects of the reporting process on the delay. Following the approach of Günther et al.[8] and others, we parameterise the delay probability (pg, t, d) via a discrete-time hazard model, i.e.,

where

hg, t, d = P(delay = d|delay ≥ d, Wg, t, d), is the so-called reporting hazard. For a case in group g with reference date t, the hazard hg, t, d states the probability of being reported with delay d, given that the case is not reported earlier. The hazard depends on a design matrix Wg, t, d, which encodes a baseline delay distribution and covariates that affect the reporting delay. We extend the model of Günther et al. by decomposing the hazard into three components,

  • Parametric baseline hazard γg, t, d: hazard derived from a parametric delay distribution, with parameters depending on covariates on the date of reference
  • Non-parametric reference date effect δg, t, d: effect on the hazard that depends on covariates on the date of reference
  • Non-parametric report date effect ϵg, t, d: effect on the hazard that depends on covariates on the date of report

Each component adds to the overall hazard through a regression with a logit link

where the maximum delay hazard is hg, t, D = 1, in order to enforce the assumption that all observations are reported within the specified maximum delay. In the following, we describe the parameterisation of the different components.

Parametric baseline hazard

The parametric baseline hazard γg, t, d for a case in group g with reference date t is modelled according to a certain discretised parametric probability distribution with parameters μg, t and υg, t. Currently, epinowcast supports four different parametric distributions: (i) log-normal (default), (ii) exponential, (iii) gamma, and (iv) log-logistic. The distributions are discretised and adjusted for an assumed maximum delay, see vignette("distributions") for details. The delay probabilities pg, t, d obtained from the discretised delay distribution are converted into hazards on the logit scale using

In the default case of the log-normal distribution, the parameters μg, t and υg, t represent the log mean and log standard deviation. Each parameter is defined using a (log-)linear model. The model consists of an intercept and a number of arbitrary, shared covariates, indexed by reference date. The covariates are multiplied by fixed (βf, i) and random (βr, i) coefficients (note that these can include auto-regressive terms), i.e.,

Non-parametric reference date effect δg, t, d and report date effect ϵg, t, d

In addition to parametric reporting effects there may also be non-parametric effects referenced by both reference and report dates. These are represented by the non-distributional logit hazard components for the date of reference and report, defined using an intercept (δ0) and arbitrary, shared covariates with fixed (βf, i) and random (βr, i) coefficients (note these can include auto-regressive terms).

All fixed (βf, i) and random (βr, i) coefficients have standard normal priors by default with standard half-normal priors for pooled standard deviations. For further implementation details see enw_reference() for delays linked to the date of reference, enw_report() for delays linked to the date of report.

Observation model and nowcast

Expected notifications by date of reference (t) and reporting delay can now be found by multiplying expected final notifications for each t with the probability of reporting for each day of delay (pg, t, d). We assume a negative binomial observation model, by default, with a joint overdispersion parameter (with a standard half normal prior on 1 over square root of the overdispersion[15]) and produce a nowcast of final observed notifications at each reference date by summing posterior estimates for unobserved notification and observed notifications for that reference date.

Where αg, t is the proportion of cases by reference date that will not report their reference date. By default this is not modelled and is set to zero , see the accounting for reported cases with a missing reference date section for further defaults. Other observation models such as the Poisson distribution are also supported. See the documentation enw_obs() for details.

In order to make best use of observed data when nowcasting we use observations where available and where they have not been reported for a given report and reference date we use the posterior prediction from the observation model above. This means that as nowcast dates become increasingly truncated they depend more on modelled estimates whereas when they are more complete the majority of the final count is known. Depending on your use case the posterior predictions alone may also be of interest.

Accounting for reported cases with a missing reference date

In real-world settings observations may be reported without a linked reference date. A common example of this is cases by date of symptom onset where report date is often known but onset date may not be. To account for this we support modelling this missing process by assuming that cases with a missing reference date have the same reporting delay distribution as cases with a known reference date and that processes that drive the probability of having a missing reference date (αg, t) are linked to the unknown date of reference rather than the date of report based on Lison et al.[11]. We model this probability flexibly on a logit scale as follows,

Where α0 represents the intercept, βf, α fixed effects, and βr, α random effects. To link with observations by date of report with a missing reference date (Mg, t) we convolve expected notifications with the probability of having a missing reference date and the probability of reporting on a given day as follows,

As for cases with known reference dates other observation models are supported. For further implementation details see enw_missing().

Implementation

The model is implemented in the probabilistic programming language stan and we use cmdstanr to interact with the model[12,13]. Optional within chain parallelisation is available across times of reference to reduce runtimes. Sparse design matrices have been used for all covariates to limit the number of probability mass functions that need to be calculated. epinowcast incorporates additional functionality written in R[16] to enable plotting nowcasts and posterior predictions, summarising nowcasts, and scoring them using scoringutils[17]. A flexible formula interface is provided to enable easier implementation of complex user specified models without interacting with the underlying code base. All functionality is modular allowing users to extend and alter the underlying model whilst continuing to use the package framework.

References

1. Abbott, S., Hellewell, J., Sherratt, K., Gostic, K., Hickson, J., Badr, H. S., DeWitt, M., Thompson, R., EpiForecasts, & Funk, S. (2020). EpiNow2: Estimate real-time case counts and time-varying epidemiological parameters. https://doi.org/10.5281/zenodo.3957489
2. Scott, J. A., Gandy, A., Mishra, S., Unwin, J., Flaxman, S., & Bhatt, S. (2020). Epidemia: Modeling of epidemics using hierarchical bayesian models. https://imperialcollegelondon.github.io/epidemia/
3. Fraser, C. (2007). Estimating individual and household reproduction numbers in an emerging epidemic. PLoS One, 2(8), e758. https://doi.org/10.1371/journal.pone.0000758
4. Cori, A., Ferguson, N. M., Fraser, C., & Cauchemez, S. (2013). A new framework and software to estimate time-varying reproduction numbers during epidemics. Am. J. Epidemiol., 178(9), 1505–1512. https://doi.org/10.1093/aje/kwt133
5. Abbott, S., Hellewell, J., Thompson, R. N., Sherratt, K., Gibbs, H. P., Bosse, N. I., Munday, J. D., Meakin, S., Doughty, E. L., Chun, J. Y., Chan, Y.-W. D., Finger, F., Campbell, P., Endo, A., Pearson, C. A. B., Gimma, A., Russell, T., Flasche, S., Kucharski, A. J., … CMMID COVID modelling group. (2020). Estimating the time-varying reproduction number of SARS-CoV-2 using national and subnational case counts. Wellcome Open Res., 5, 112. https://doi.org/10.12688/wellcomeopenres.16006.2
6. Bhatt, S., Ferguson, N., Flaxman, S., Gandy, A., Mishra, S., & Scott, J. A. (2020). Semi-Mechanistic bayesian modeling of COVID-19 with renewal processes. https://arxiv.org/abs/2012.00394
7. Champredon, D., Dushoff, J., & Earn, D. J. D. (2018). Equivalence of the Erlang-Distributed SEIR epidemic model and the renewal equation. SIAM J. Appl. Math., 78(6), 3258–3278. https://doi.org/10.1137/18M1186411
8. Günther, F., Bender, A., Katz, K., Küchenhoff, H., & Höhle, M. (2021). Nowcasting the COVID-19 pandemic in Bavaria. Biometrical Journal, 63(3), 490–502. https://doi.org/10.1002/bimj.202000112
9. Höhle, M., & Heiden, M. an der. (2014). Bayesian nowcasting during the STEC O104:H4 outbreak in Germany, 2011. Biometrics, 70(4), 993–1002. https://doi.org/10.1111/biom.12194
10. Meyer, S., Held, L., & Höhle, M. (2017). Spatio-temporal analysis of epidemic phenomena using the R package surveillance. Journal of Statistical Software, 77(11), 1–55. https://doi.org/10.18637/jss.v077.i11
11. Lison, A. (n.d.). Nowcast-transmission. Github.
12. Team, S. D. (2021). Stan modeling language users guide and reference manual, 2.28.1.
13. Gabry, J., & Češnovar, R. (2021). Cmdstanr: R interface to ’CmdStan’.
14. Bastos, L. S., Economou, T., Gomes, M. F. C., Villela, D. A. M., Coelho, F. C., Cruz, O. G., Stoner, O., Bailey, T., & Codeço, C. T. (2019). A modelling approach for correcting reporting delays in disease surveillance data. Statistics in Medicine, 38(22), 4363–4377. https://doi.org/10.1002/sim.8303
15. Team, S. D. (2020). Prior choice recommendations.
16. R Core Team. (2019). R: A language and environment for statistical computing. R Foundation for Statistical Computing. https://www.R-project.org/
17. Bosse, N. (2020). Scoringutils: A collection of proper scoring rules and metrics to assess predictions. https://github.com/epiforecasts/scoringutils