Families and batches of parameters in Bayesian inference: how to treat them using ggmcmc

Author: Xavier Fernández-i-Marín
February 22, 2019 - 10 minutes
Tutorial of how to deal with families of parameters in Bayesian inference using ggmcmc
ggmcmc R Bayesian

When working with Bayesian inference using sampling procedures (JAGS, Stan, Bugs) it is usually the case that the parameters come in different families (the “thetas”, the “betas”, the “alphas”), specially when dealing with hierarchical / multilevel models.

Sometimes parameters come in a single dimension (like beta[1], beta[2], beta[3], …), and sometimes they come in multiple dimensions (beta[1,1], beta[1,2], …). The reasons may be diverse, but it is quite common that one dimension refers to, for instance, an intercept, and the second dimension referst to different conditions of the experiment or the observation. This entry shows how to flexibly manage different situations in which families of parameters are involved.

Let’s start with the data provided by the ggmcmc package.

library(ggmcmc)
data(linear)

The original data is a coda object called s, that stores several parameters, some of them belonging to the same family, in this case, the beta family.

S <- ggs(s)
levels(S$Parameter)
## [1] "beta[1]" "beta[2]" "sigma"

Single dimensional parameters

In case only the beta family of parameters is needed, this can be easily accomplished with the argument family, that accepts an R regular expression. In its simplest form, passing “beta” will keep all parameters containing it, which includes not only “beta”, but also possible “sigma.beta”, or other combinations.

S.beta <- ggs(s, family = "beta")
S.beta
## # A tibble: 4,000 x 4
##    Iteration Chain Parameter value
##        <int> <int> <fct>     <dbl>
##  1         1     1 beta[1]    3.04
##  2         2     1 beta[1]    2.85
##  3         3     1 beta[1]    2.85
##  4         4     1 beta[1]    2.58
##  5         5     1 beta[1]    2.46
##  6         6     1 beta[1]    3.05
##  7         7     1 beta[1]    2.67
##  8         8     1 beta[1]    2.64
##  9         9     1 beta[1]    2.76
## 10        10     1 beta[1]    2.74
## # … with 3,990 more rows
levels(S.beta$Parameter)
## [1] "beta[1]" "beta[2]"

The S.beta object only contains the “beta” parameters, and also only contains the bare minimum variables for a ggs() object to work, namely: Iteration, Chain, Parameter and value.

Now the S.beta object can be treated like any other ggs() object:

ggs_traceplot(S.beta)
Default representation of a traceplot showing the samples of a family of parameters.

Figure 1: Default representation of a traceplot showing the samples of a family of parameters.

Single dimensional parameters with labels

There is the possibility to complicate a bit more the representation when combining the family parameter with the par_labels one, that allows to incorporate meaningful labels to parameters. In this case, for instance, we will define the “beta” parameters as having different labels, but sharing a single dimension. We create a data frame that maps/links the original parameter names (beta[1] and beta[2] to different properties (like being in a single dimension called “First dimension”, but being different in the value of their Variable name.

L.betas <- expand.grid(b = 1:2) %>%
  mutate(Parameter = paste("beta[", b, "]", sep = ""),
         Variable = factor(b, labels = c("One", "Two")),
         Dimension = "First dimension",
         Label = paste(Variable, Dimension, sep = " : "))
L.betas
##   b Parameter Variable       Dimension                 Label
## 1 1   beta[1]      One First dimension One : First dimension
## 2 2   beta[2]      Two First dimension Two : First dimension

Now we also incorporate this dataframe into the ggs() object, using the par_labels argument:

S.beta <- ggs(s, family = "beta", par_labels = L.betas)
S.beta
## # A tibble: 4,000 x 8
##    Iteration Chain Parameter value ParameterOrigin…     b Variable
##        <int> <int> <fct>     <dbl> <fct>            <int> <fct>   
##  1         1     1 One : Fi…  3.04 beta[1]              1 One     
##  2         2     1 One : Fi…  2.85 beta[1]              1 One     
##  3         3     1 One : Fi…  2.85 beta[1]              1 One     
##  4         4     1 One : Fi…  2.58 beta[1]              1 One     
##  5         5     1 One : Fi…  2.46 beta[1]              1 One     
##  6         6     1 One : Fi…  3.05 beta[1]              1 One     
##  7         7     1 One : Fi…  2.67 beta[1]              1 One     
##  8         8     1 One : Fi…  2.64 beta[1]              1 One     
##  9         9     1 One : Fi…  2.76 beta[1]              1 One     
## 10        10     1 One : Fi…  2.74 beta[1]              1 One     
## # … with 3,990 more rows, and 1 more variable: Dimension <chr>

In this case, the object now does not only contain the bare minimum variables, but also the other ones passed through the mapping of labels and features of the “beta” parameters, namely the “b”, the “Variable”, and the “Dimension”. The original parameter name is kept under “ParameterOriginal” and the new parameter name used in the labels is the one defined by “Label” in the L.betas object.

Now we can treat this object with more flexibility, as it contains features that define the “beta” parameters.

ggs_traceplot(S.beta) + 
  facet_grid(~ Variable)

Multiple dimensions

In order to work with multiple dimensional parameters, we first have to fake it using the samples at hand. The following steps are never needed when the data is already structured in two (or more) dimensions.

# generate a bidimensional beta matrix out of repeating the first
# chain of the original object and the second chain with minus/plus 
# differences.
s.dup <- list(cbind(s[[1]] - 2, s[[1]] + 2), cbind(s[[2]] - 2, s[[2]] + 2))
dimnames(s.dup[[1]])[[2]] <- dimnames(s.dup[[2]])[[2]] <- 
  c("beta[1,1]", "beta[1,2]", "sigma[1]", "beta[2,1]", "beta[2,2]", "sigma[2]")
attributes(s.dup[[1]])$mcpar <- attributes(s.dup[[2]])$mcpar <- 
  attributes(s[[1]])$mcpar
attributes(s.dup[[1]])$class <- attributes(s.dup[[2]])$class <- 
  attributes(s[[1]])$class
class(s.dup) <- "mcmc.list"

The s.dup object now contains a two chains for 6 parameters (4 of them are from the “beta” family).

S.dup <- ggs(s.dup)
str(S.dup)
## Classes 'tbl_df', 'tbl' and 'data.frame':    12000 obs. of  4 variables:
##  $ Iteration: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Chain    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Parameter: Factor w/ 6 levels "beta[1,1]","beta[1,2]",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ value    : num  1.043 0.853 0.85 0.575 0.463 ...
##  - attr(*, "nChains")= int 2
##  - attr(*, "nParameters")= int 6
##  - attr(*, "nIterations")= int 1000
##  - attr(*, "nBurnin")= num 1000
##  - attr(*, "nThin")= num 1
##  - attr(*, "description")= chr "s.dup"

Then, create again the object with the labels for a two-dimensional family of parameters.

L.betas <- expand.grid(v = 1:2, d = 1:2) %>%
  mutate(Parameter = paste("beta[", v, ",", d, "]", sep = ""),
         Variable = factor(v, labels = c("One", "Two")),
         Dimension = factor(d, labels = c("First", "Second")),
         Label = paste(Variable, Dimension, sep = " : "))
L.betas
##   v d Parameter Variable Dimension        Label
## 1 1 1 beta[1,1]      One     First  One : First
## 2 2 1 beta[2,1]      Two     First  Two : First
## 3 1 2 beta[1,2]      One    Second One : Second
## 4 2 2 beta[2,2]      Two    Second Two : Second
S.beta <- ggs(s.dup, family = "beta", par_labels = L.betas)

And now just play with the display of the two variables (“Variable” and “Dimension”) than contain characteristics of the “beta” parameters.

ggs_traceplot(S.beta) + 
  facet_grid(Dimension ~ Variable)

ggs_density(S.beta, hpd = TRUE) +
  facet_wrap(Variable ~ Dimension)

The examples covered here are only for families of parameters with one or two dimensions, but extending it to more dimensions is trivial with the modification of the dataset with the parameter labels. For instance, adding another dimension (let’s call it “Year”) would be as follows:

L.betas.time <- expand.grid(v = 1:2, d = 1:2, y = 1:9) %>%
  mutate(Parameter = paste("beta[", v, ",", d, ",", y, "]", sep = ""),
         Variable = factor(v, labels = c("One", "Two")),
         Dimension = factor(d, labels = c("First", "Second")),
         Year = as.numeric(as.character(factor(y, labels = 2001:2009))),
         Label = paste(Variable, Dimension, sep = " : "))

Example: electoral forecast in Bayern 2018

In a post on forecasting elections I presented the electoral forecasts for different parties through time. The \(\xi\) (xi) parameters account for the electoral support of every party (\(p\)) at a specific day (\(t\), for “time”), and therefore have two dimensions (\(\xi_{p,t}\)).

In order to generate a data frame with the meaningful labels for \(\xi\) we use the following:

# "names.parties" is a character vector containing the names of the parties
# in the correct order.
# "series.start" is a date when the forecast begins.
L.xis <- expand.grid(p = 1:PA, t = 1:T) %>%
  mutate(Parameter = paste("xi[", p, ",", t, "]", sep = "")) %>%
  mutate(Party = factor(p, label = names.parties)) %>%
  mutate(Date = series.start + t - 1) %>%
  mutate(Label = paste(Party, Date, sep = " : "))
xis <- ggs(s, family = "xi", par_labels = L.xis)
str(xis)
## Classes 'tbl_df', 'tbl' and 'data.frame':    5564800 obs. of  9 variables:
##  $ Iteration        : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Chain            : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Parameter        : Factor w/ 13912 levels "AfD : 2014-01-09",..: 1740 1740 1740 1740 1740 1740 1740 1740 1740 1740 ...
##  $ value            : num  49.8 48.7 49.5 49.6 48.5 ...
##  $ ParameterOriginal: Factor w/ 13912 levels "xi[1,1]","xi[1,10]",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ p                : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ t                : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Party            : Factor w/ 8 levels "CSU","SPD","FW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Date             : Date, format: "2014-01-09" "2014-01-09" ...
##  - attr(*, "nChains")= int 2
##  - attr(*, "nParameters")= int 13912
##  - attr(*, "nIterations")= int 200
##  - attr(*, "nBurnin")= num 50091
##  - attr(*, "nThin")= num 10
##  - attr(*, "description")= chr "s"

xis is a huge object containing 5,564,800 entries, corresponding to 13,912 parameters, two chains and 200 iterations. The reason is that the forecast is performed for every day between the last election (in 2014) until the election date in late 2018. Once the convergence tests have been performed using the ggs_...() functions, we can simply summarize the posteriors using the ci() function and keeping only the point estimates of interest.

ci.xis <- ci(xis)
ci.xis
## # A tibble: 13,912 x 11
##    Parameter   low   Low median  High  high ParameterOrigin…     p     t
##    <fct>     <dbl> <dbl>  <dbl> <dbl> <dbl> <fct>            <int> <int>
##  1 AfD : 20…  7.94  8.01   8.92  9.47  9.62 xi[8,1]              8     1
##  2 AfD : 20…  7.93  8.01   8.93  9.51  9.56 xi[8,2]              8     2
##  3 AfD : 20…  7.99  8.02   8.96  9.44  9.58 xi[8,3]              8     3
##  4 AfD : 20…  7.96  8.02   8.97  9.48  9.56 xi[8,4]              8     4
##  5 AfD : 20…  7.96  8.03   9.02  9.48  9.59 xi[8,5]              8     5
##  6 AfD : 20…  7.95  7.99   9.03  9.44  9.49 xi[8,6]              8     6
##  7 AfD : 20…  7.93  8.01   9.03  9.43  9.50 xi[8,7]              8     7
##  8 AfD : 20…  7.91  7.96   9.04  9.43  9.50 xi[8,8]              8     8
##  9 AfD : 20…  7.90  7.97   9.04  9.42  9.50 xi[8,9]              8     9
## 10 AfD : 20…  7.94  7.96   9.07  9.44  9.49 xi[8,10]             8    10
## # … with 13,902 more rows, and 2 more variables: Party <fct>, Date <date>

In this case ci.xis contains “only” 13,912 entries, with the median, 90 and 95 percent credible intervals. Because it also incorporates the labels and meta-data about \(xi\), it is possible to use a simple call to ggplot() to generate the figures of interest.

This is a simple point estimate of the temporal evolution of the forecasted value for every party.

ggplot(ci.xis, aes(x = Date, y = median,
                   color = Party, group = Party)) +
  geom_line()

In order to incorporate the uncertainty bands we can simply use the geom_ribbon() element overimposed to the line.

ggplot(ci.xis, aes(x = Date, y = median,
                   color = Party, group = Party)) +
  geom_line() +
  geom_ribbon(aes(ymin = low, ymax = high, 
                  fill = Party), color = NA, alpha = 0.2)

Final remarks

The combination of a tidy dataset in the ggs format with the arguments “family” and “par_labels” provides a simple and unified way to treat samples generated by JAGS / Stan / Bugs using as a basis the functions provided by the ggmcmc package, namely those that start with ggs_...(). Combining them with facet_grid() or facet_wrap(), or aes() and many other options in ggplot2 gives the user a lot of flexibility.

Citation

When usind the ggmcmc don’t forget to cite it using: (Fernández-i-Marín 2016)

Fernández-i-Marín, Xavier. 2016. “Ggmcmc: Analysis of Mcmc Samples and Bayesian Inference.” Journal of Statistical Software 70 (1): 1–20. doi:10.18637/jss.v070.i09.

Comparison of Rhat versions: clarifying formulas for the potential scale reduction factor and its implications

Author: Xavier Fernández-i-Marín
March 6, 2019 - 7 minutes
Description of how to calculate Rhat (Potential Scale Reduction Factor) for Bayesian convergence usind different formulas, and its impact on the length of the chains
ggmcmc R Bayesian

V-Dem by default: Load and process V-Dem democracy scores in R

Author: Xavier Fernández-i-Marín
January 11, 2019 - 6 minutes
Tutorial of how to perform data analysis of the Varieties of Democracies (V-Dem) dataset in R
Governance R Data visualization

Bavarian state elections 2018 - Evaluation of the forecasts

Author: Xavier Fernández-i-Marín
October 16, 2018 - 3 minutes
Comparison of the electoral results and forecasts for Bavarian elections 2018
Electoral forecast Bayesian Data visualization
comments powered by Disqus