--- title: Introduction to rbi author: Sebastian Funk output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Introduction to rbi} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} NOT_CRAN <- interactive() || identical(tolower(Sys.getenv("NOT_CRAN")), "true") # nolint knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` This vignette gives an introduction to using **rbi**. For the best viewing experience, use the [version on the rbi website](https://sbfnk.github.io/rbi/articles/introduction.html). [rbi](https://github.com/sbfnk/rbi) is an `R` interface to [LibBi](https://libbi.org), a library for Bayesian inference with state-space models using high-performance computer hardware. The package has been tested on macOS and Linux. It requires a working installation of **LibBi**. On macOS, this is easiest done using the `brew` command: Install [Homebrew](https://brew.sh), then issue the following command (using a command shell, i.e. Terminal or similar): ```{sh, eval = FALSE} brew install libbi ``` On linux, follow the [instructions](https://github.com/lawmurray/LibBi/blob/master/INSTALL_LINUX.md) provided with LibBi. If you have any trouble installing **LibBi** you can get help on the [LibBi Users](https://groups.google.com/forum/#!forum/libbi-users) mailing list. The path to the `libbi` script can be passed as an argument to **rbi**, otherwise the package tries to find it automatically using the `which` linux/unix command. If you just want to process the output from **LibBi**, then you do not need to have **LibBi** installed. # Installation The easiest way to install the latest stable version of **rbi** is via CRAN. ```{r, eval = FALSE} install.packages("rbi") ``` Alternatively, the current development version can be installed using the `remotes` package ```{r, eval = FALSE} remotes::install_github("sbfnk/rbi") ``` # Loading the package Use ```{r, eval = FALSE} library("rbi") ``` ```{r, echo = FALSE} suppressPackageStartupMessages(library("rbi")) ``` to load the package. # Getting started The main computational engine and model grammar behind **rbi** is provided by **LibBi**. The [LibBi manual](https://libbi.org/docs/LibBi-Manual.pdf) is a good place to start for finding out everything there is to know about **LibBi** models and inference methods. The **rbi** package mainly provides two classes: `bi_model` and `libbi`. The `bi_model` class is used to load, view and manipulate **LibBi** model files. The `libbi` class is used to run LibBi and perform inference. The package also provides two methods for interacting with the [NetCDF](https://www.unidata.ucar.edu/software/netcdf/) files used by **LibBi**, `bi_read` and `bi_write`. Lastly, it provides a `get_traces` function to analyse Markov-chain Monte Carlo (MCMC) traces using the [coda](https://cran.r-project.org/package=coda) package. # The `bi_model` class As an example, we consider a simplified version of the SIR model discussed in [Del Moral et al. (2014)](https://arxiv.org/abs/1405.4081). This is included with the **rbi** package and can be loaded with ```{r} model_file <- system.file(package = "rbi", "SIR.bi") sir_model <- bi_model(model_file) # load model ``` Other ways of implementing a (deterministic or stochastic) SIR model can be found in the [collection of SIR models for LibBi](idd_models.html), where you also find how to load them into a `bi_model` object, e.g. `sir_model`. Feel free to run the commands below with different versions of the model. The `sir_model` object now contains the model, which can be displayed with ```{r} sir_model ``` A part of the model can be shown with, for example, ```{r} sir_model[35:38] ``` or, for example, ```{r} get_block(sir_model, "parameter") ``` To get a list of certain variables, you can use the `var_names` function. For example, to get a list of states, you can use ```{r} var_names(sir_model, type = "state") ``` There are also various methods for manipulating a model, such as `remove_lines`, `insert_line`, `replace_all`. The `fix` method fixes a variable to one value. This can be useful, for example, to run the deterministic equivalent of a stochastic model for testing purposes: ```{r} det_sir_model <- fix(sir_model, n_transmission = 0, n_recovery = 0) ``` To get documentation for any of these methods, use the links in the documentation for `bi_model`. # Generating a dataset First, let's create a data set from the SIR model. ```{r, eval = NOT_CRAN} set.seed(1001912) sir_data <- generate_dataset(sir_model, end_time = 16 * 7, noutputs = 16) ``` This simulates the model a single time from time 0 until time 16\*7 (say, 16 weeks with a daily time step), producing 16 outputs (one a week). Note that we have specified a random seed to make this document reproducible. If you omit the `set.seed` command or set it to a different number, the results will be different even when run with the same set of commands. Also note that *LibBi* compiles the model code only the first time it is run. If you run the command above a second time, it should run much faster. The `generate_dataset` function returns a `libbi` object: ```{r, eval = NOT_CRAN} sir_data ``` The generated dataset can be viewed and/or stored in a variable using `bi_read`: ```{r, eval = NOT_CRAN} dataset <- bi_read(sir_data) ``` The `bi_read` function takes the name of a NetCDF file or a `libbi` object (in which case it locates the output file) and stores the contents in a list of data frames or vectors, depending on the dimensionality of the contents. Note that, if no `working_folder` is specified, the model and output files will be stored in a temporary folder. ```{r, eval = NOT_CRAN} names(dataset) dataset$p_R0 dataset$Incidence ``` We can visualise the generated incidence data with ```{r, eval = NOT_CRAN} plot(dataset$Incidence$time, dataset$Incidence$value) lines(dataset$Incidence$time, dataset$Incidence$value) ``` # The `libbi` class The `libbi` class manages the interaction with **LibBi** such as sampling from the prior or posterior distribution. For example, the `sir_data` object above is of type `libbi`: ```{r, eval = NOT_CRAN} class(sir_data) ``` Th `bi_generate_dataset` is one particular way of generating a `libbi` object, used only to generate test data from a model. The standard way of creating a `libbi` object for Bayesian inference is using the `libbi` command ```{r} bi <- libbi(sir_model) ``` This initialises a `libbi` object with the model created earlier and assigns it to the variable `bi`. ```{r} class(bi) ``` Let's sample from the prior of the SIR model: ```{r, eval = NOT_CRAN} bi_prior <- sample( bi, target = "prior", nsamples = 1000, end_time = 16 * 7, noutputs = 16 ) ``` This step calls **LibBi** to sample from the prior distribution of the previously specified model, generating 1,000 samples and each time running the model for 16 * 7 = 112 time steps and writing 16 outputs (i.e., every 7 time steps). **LibBi** parses the model, creates C++ code, compiles it and run the model. If the model is run again, it should do so much quicker because it will use the already compiled C++ code to run the model: ```{r, eval = FALSE} bi_prior <- sample(bi_prior) ``` The `sample` command returns an updated `libbi` object which, in this case, we just assign again to the `bi` object. Any call of `sample` preserves options passed to the previous call of `sample` and `libbi`, unless they are overwritten by arguments passed to `sample` (e.g., passing a new `nsamples` argument). Let's have a closer look at the `bi` object: ```{r, eval = NOT_CRAN} bi_prior ``` To see even more detail, try ```{r, eval = NOT_CRAN} str(bi_prior) ``` We can see the object contains 14 fields, including the model, the path to the `libbi` script, and the command used to run libbi (`bi$command`); the `options` field contains all the options that **LibBi** was called with. This includes the ones we passed to `sample` ```{r, eval = NOT_CRAN} bi_prior$options ``` The other fields contain various bits of information about the object, including the model used, the command used to run **LibBi** (`bi$command`) and the output file name: ```{r, eval = NOT_CRAN} bi_prior$output_file_name ``` We can get the results of the sampling run using `bi_read` ```{r, eval = NOT_CRAN} prior <- bi_read(bi_prior$output_file_name) ``` or with the shorthand ```{r, eval = NOT_CRAN} prior <- bi_read(bi_prior) ``` which looks at the `output_file_name` field to read in the data. Let's look at the returned object ```{r, eval = NOT_CRAN} str(prior) ``` This is a list of 9 objects, 8 representing each of the (noise/state) variables and parameters in the file, and one number `clock`, representing the time spent running the model in microseconds. We can see that the time-varying variables are represented as data frames with three columns: `np` (enumerating individual simulations), `time` and `value`. Parameters don't vary in time and just have `np` and `value` columns. # Fitting a model to data using PMCMC Let's perform inference using Particle Markov-chain Metropolis Hastings (PMMH). The following command will generate 16 * 10,000 = 160,000 simulations and therefore may take a little while to run (if you want to see the samples progress, use `verbose=TRUE` in the `sample` call). ```{r, eval = NOT_CRAN} bi <- sample(bi_prior, target = "posterior", nparticles = 32, obs = sir_data) ``` This samples from the posterior distribution. Remember that options are preserved from previous runs (because we passed the `bi`) as first argument, so we don't need to specify `nsamples`, `end_time` and `noutputs` again, unless we want to change them. The `nparticles` option specifies the number of particles. You can also pass a list of data frames (each element of the list corresponding to one observed variable as the `obs` argument, for example ```{r, eval = NOT_CRAN} df <- data.frame( time = c(0, 7, 14, 21, 28, 35, 42, 49, 56, 63, 70, 77, 84, 91, 98, 105, 112), value = c(1, 6, 2, 26, 99, 57, 78, 57, 15, 9, 4, 1, 1, 1, 0, 2, 0) ) bi_df <- sample( bi_prior, target = "posterior", nparticles = 32, obs = list(Incidence = df) ) ``` Input, init and observation files (see the [LibBi manual](https://libbi.org/docs/LibBi-Manual.pdf) for details) can be specified using the `init`, `input`, `obs` options, respectively. They can each be specified either as the name of a NetCDF file containing the data, or a `libbi` object (in which case the output file will be taken) or directly via an appropriate `R` object containing the data (e.g., a character vector of length one, or a list of data frames or numeric vectors). In the case of the command above, `init` is specified as a list, and `obs` as a `libbi` object. The `Incidence` variable of the `sir_data` object will be taken as observations. The time dimension (or column, if a data frame) in the passed `init`, `input` and/or `obs` files can be specified using the `time_dim` option. If this is not given, it will be assumed to be `time`, if such a dimension exists or, if not, any numeric column not called `value` (or the contents of the `value_column` option). If this does not produce a unique column name, an error will be thrown. All other dimensions/columns in the passed options will be interpreted as additional dimensions in the data, and stored in the `dims` field of the `libbi` object. Any other options (apart from `log_file_name`, see the [Debugging](#debugging) section) will be passed on to the command `libbi` -- for a complete list, see the [LibBi manual](https://libbi.org/docs/LibBi-Manual.pdf). Hyphens can be replaced by underscores so as not to confuse R (see `end_time`). Any arguments starting with `enable`/`disable` can be specified as boolean (e.g., `assert=TRUE`). Any `dry-` options can be specified with a `"dry"` argument, e.g., `parse="dry"`. # Analysing an MCMC run Let's get the results of the preceding `sample` command: ```{r, eval = NOT_CRAN} bi_contents(bi) posterior <- bi_read(bi) str(posterior) ``` We can see that this has two more objects than previously when we specified `target="prior"`: `loglikelihood` (the estimated log-likelihood of the parameters at each MCMC step) and `logprior` (the estimated log-prior density of the parameters at each MCMC step). To get a summary of the parameters sampled, use ```{r, eval = NOT_CRAN} summary(bi) ``` A summary of sampled trajectories can be obtained using ```{r, eval = NOT_CRAN, results = "hide"} summary(bi, type = "state") ``` Any particular posterior sample can be viewed with `extract_sample` (with indices running from 0 to `nsamples - 1`): ```{r, eval = NOT_CRAN, results = "hide"} extract_sample(bi, 314) ``` To analyse MCMC outputs, we can use the **coda** package and the `get_traces` function of **rbi**. Note that, to get exactly the same traces, you would have to set the seed as above. ```{r, eval = NOT_CRAN} library("coda") traces <- mcmc(get_traces(bi)) ``` We can, for example, visualise parameter traces and densities with ```{r, eval = NOT_CRAN, fig.width = 8, fig.height = 8} plot(traces) ``` Compare this to the marginal posterior distributions to the "correct" parameters used to generate the data set: ```{r, eval = NOT_CRAN} bi_read(sir_data, type = "param") ``` For more details on using **coda** to further analyse the chains, see the website of the [coda package](https://cran.r-project.org/package=coda). For more plotting functionality, the [ggmcmc package](https://cran.r-project.org/package=ggmcmc) is also worth considering. # Predictions We can use the `predict` function to re-simulate the fitted model using the estimated parameters, that is to generate samples from $p(x_t|\theta)$ where the $\theta$ are distributed according to the marginal posterior distribution $p(\theta|y^*_t)$ (here: $\theta$ are fixed parameters, $x_t$ are state trajectories and $y^*_t$ observed data points, as in the [LibBi manual](https://libbi.org/docs/LibBi-Manual.pdf)). This can be useful, for example, for comparing typical model trajectories to the data, or for running the model beyond the last data point. ```{r, eval = NOT_CRAN} pred_bi <- predict( bi, start_time = 0, end_time = 20 * 7, output_every = 7, with = c("transform-obs-to-state") ) ``` where `with=c("transform-obs-to-state")` tells LibBi to treat observations as a state variable, that is to randomly generate observations, i.e. samples from $p(y_t|\theta)$ where, again, $\theta$ are distributed according to the posterior distribution $p(\theta|y^*_t)$ (see the `with-transform-obs-to-state` option in the [LibBi manual](https://libbi.org/docs/LibBi-Manual.pdf)). # Sample observations To sample observations from sampled posterior state trajectories, that is samples from $p(y_t|x_t)$ where the $x_t$ are distributed according to the posterior distribution $p(x_t | y_t)$, you can use. ```{r, eval = NOT_CRAN} obs_bi <- sample_obs(bi) ``` Compare this to the data: ```{r, eval = NOT_CRAN} summary(obs_bi, type = "obs") dataset$Incidence ``` # Filtering The other so-called clients of LibBi (besides `sample`) are supported through commands of the same name: `filter, optimise and rewrite. For example, to run a particle filter on the last posterior sample generated above, you can use: ```{r, eval = NOT_CRAN} bi_filtered <- filter(bi) ``` # Plotting Output form LibBi runs can be visualised using standard R plotting routines or plotting packages such as `ggplot2`. The `summary` function can help with this. For example, to plot observations randomly generated from the posterior distribution of the parameters and compare them to the data we can use ```{r, eval = NOT_CRAN} ps <- summary(pred_bi, type = "obs") library("ggplot2") ggplot(ps, aes(x = time)) + geom_line(aes(y = Median)) + geom_ribbon(aes(ymin = `1st Qu.`, ymax = `3rd Qu.`), alpha = 0.5) + geom_point(aes(y = value), dataset$Incidence, color = "darkred") + ylab("cases") ``` where we have plotted the median fit as a black line, the interquartile range as a grey ribbon, and the data points as dark red dots. Compare this to observations randomly generated from the posterior distribution of trajectories: ```{r, eval = NOT_CRAN} os <- summary(obs_bi, type = "obs") ggplot(os, aes(x = time)) + geom_line(aes(y = Median)) + geom_ribbon(aes(ymin = `1st Qu.`, ymax = `3rd Qu.`), alpha = 0.5) + geom_point(aes(y = value), dataset$Incidence, color = "darkred") + ylab("cases") ``` # Saving and loading `libbi` objects **rbi** provides its own versions of the `saveRDS` and `readRDS` functions called `save_libbi` and `read_libbi`. These make sure that all information (including any options, input, init and observation files) is stored in the object. ```{r, eval = NOT_CRAN} save_libbi(bi, "bi.rds") bi <- read_libbi("bi.rds") bi ``` # Creating `libbi` objects from previous runs To recreate a `libbi` object from a previous R session, use `attach_data`. For example, one could use the following code to get the acceptance rate for a *LibBi* run with a given output and model file: ```{r, eval = NOT_CRAN} pz_run_output <- bi_read(system.file(package = "rbi", "example_output.nc")) pz_model_file <- system.file(package = "rbi", "PZ.bi") pz_posterior <- attach_data(libbi(pz_model_file), "output", pz_run_output) traces <- mcmc(get_traces(pz_posterior)) a <- 1 - rejectionRate(traces) a ``` # Debugging For a general check of model syntax, the `rewrite` command is useful: ```{r, eval = FALSE} rewrite(sir_model) ``` This generates the internal representation of the model for LibBi. It doesn't matter so much what this looks like, but it will throw an error if there is a problem. If `libbi` throws an error, it is best to investigate with `debug = TRUE`, and setting `working_folder` to a folder that one can then use for debugging. Output of the `libbi` call can be saved in a file using the `log_file_name` option (by default a temporary file). # Related packages [rbi.helpers](https://github.com/sbfnk/rbi.helpers) contains higher-level methods to interact with **LibBi**, including methods for plotting the results of libbi runs and for adapting the proposal distribution and number of particles. For more information, see the [rbi.helpers vignette](http://sbfnk.github.io/rbi.helpers/articles/introduction.html). # References * Murray, L.M. (2013). [Bayesian state-space modelling on high-performance hardware using LibBi](https://arxiv.org/abs/1306.3277).