Comparison of Rhat versions: clarifying formulas for the potential scale reduction factor and its implications

Author: Xavier Fernández-i-Marín
March 6, 2019 - 7 minutes
Description of how to calculate Rhat (Potential Scale Reduction Factor) for Bayesian convergence usind different formulas, and its impact on the length of the chains
ggmcmc R Bayesian

One of the formal measures of convergence in Bayesian inference is the Potential Scale Reduction Factor (\(\hat{R}\), “Rhat”). It is a weighted average of the Between-chain and Within-chain variances. The idea is that the closer this weighted average is to 1, the fewer the evidence that the chains have not converged.

Multiple versions for the Rhat

The formula for the concrete weighted average, however, has changed, and there are multiple specifications of it. A more detailed discussion can be found in this post in stackexchange.com, but the main idea is the following:

  • The first idea of a potential scale reduction factor appears in (Gelman and Rubin 1992)
  • There is a minor change in the second version in (Brooks and Gelman 1998) that entails changing the part of the degrees of freedom from df/(df-2) to (df+3)/(df+1). This is the version that the R package coda uses and it is possible to use in ggmcmc when using the argument version_Rhat = "BG98" in the ggs_Rhat() function. Let’s call this the “BG98” version.
  • A newer version appears in the first version of Bayesian Data Analysis (Gelman et al. 1995). It does not include any sort of weighting by degrees of freedom.
  • The most prevalent version, though is the one in the second version of the second edition of Bayesian Data Analysis (let’s call this the “BDA2” version), although paradoxically is not the one used in coda, neither in stan currently. But this may change in the future.
  • In Bayesian Data Analysis, the third edition, the formula is equal to “BDA2”, but the difference is that the original chains are broken by half and treated as if there were twice the chains of half the original size.

ggmcmc’s function ggs_Rhat() contains (starting from version 1.3) an argument that allows the user to specify which version of the computation is used, the “BG98” or the “BDA2” (the only one available since now and the default one from now on).

Example using the radon data in ggmcmc

This example aims at clarifying the differences between both versions (“BR98” and “BDA2”) and their implications in different problems, specifically when comparing longer or shorter chains.

library(ggmcmc)
data(radon)
library(coda)

I am comparing the following: - Source: the Rhat as calculated by “coda” or “ggmcmc”. - Versions: the “BG98” and the “BDA2” - Chain Length: the original “radon” example contains only 100 iterations in 2 chains (“Long”), and the shorter version even fewer (20, “Short”).

The following lines simply arrange the Rhat values for all the parameters (175) in the “radon” object (an “mcmc.list”) in a tidy dataframe.

### Full output (100 iterations, "Long" length)
s.full <- radon$s.radon
gg.full <- ggs_Rhat(ggs(s.full), version_rhat = "BG98")$data
gg.full.bda2 <- ggs_Rhat(ggs(s.full), version_rhat = "BDA2")$data
coda.full <- gelman.diag(s.full, autoburnin = F, multivariate = F)$psrf

comp.full <- as.data.frame.table(coda.full, 
                                 responseName = "Rhat") %>%
  tbl_df() %>%
  filter(Var2 == "Point est.") %>%
  select(-Var2) %>%
  rename(Parameter = Var1) %>%
  mutate(Source = "coda", Version = "BG98", Length = "Long")
comp.full <- bind_rows(comp.full, mutate(comp.full, Version = "BDA2")) %>%
  bind_rows(mutate(gg.full, Source = "ggmcmc", Version = "BG98", Length = "Long")) %>%
  bind_rows(gg.full.bda2 %>%
            select(-B, -W, -wa) %>%
            mutate(Source = "ggmcmc", Version = "BDA2", Length = "Long"))

### Part of the output (20 iterations, "Short" length)
s.short <- lapply(radon$s.radon, function(x) x[1:20,])
class(s.short) <- "mcmc.list"
gg.short <- ggs_Rhat(ggs(s.short), version_rhat = "BG98")$data
gg.short.bda2 <- ggs_Rhat(ggs(s.short), version_rhat = "BDA2")$data
coda.short <- gelman.diag(s.short, autoburnin = F, multivariate = F)$psrf

comp.short <- as.data.frame.table(coda.short, 
                                 responseName = "Rhat") %>%
  tbl_df() %>%
  filter(Var2 == "Point est.") %>%
  select(-Var2) %>%
  rename(Parameter = Var1) %>%
  mutate(Source = "coda", Version = "BG98", Length = "Short")
comp.short <- bind_rows(comp.short, mutate(comp.short, Version = "BDA2")) %>%
  bind_rows(mutate(gg.short, Source = "ggmcmc", Version = "BG98", Length = "Short")) %>%
  bind_rows(gg.short.bda2 %>%
            select(-B, -W, -wa) %>%
            mutate(Source = "ggmcmc", Version = "BDA2", Length = "Short"))

comp <- bind_rows(comp.full, comp.short)

The comparison object comp is a data frame with 1,400 observations (175 parameters * 2 Lengths * 2 Sources * 2 Versions).

comp
## # A tibble: 1,400 x 5
##    Parameter  Rhat Source Version Length
##    <fct>     <dbl> <chr>  <chr>   <chr> 
##  1 alpha[1]  1.01  coda   BG98    Long  
##  2 alpha[2]  0.997 coda   BG98    Long  
##  3 alpha[3]  1.00  coda   BG98    Long  
##  4 alpha[4]  1.00  coda   BG98    Long  
##  5 alpha[5]  0.996 coda   BG98    Long  
##  6 alpha[6]  1.02  coda   BG98    Long  
##  7 alpha[7]  1.00  coda   BG98    Long  
##  8 alpha[8]  0.999 coda   BG98    Long  
##  9 alpha[9]  0.998 coda   BG98    Long  
## 10 alpha[10] 1.01  coda   BG98    Long  
## # … with 1,390 more rows

In order to focus on a comparison at the “Source” level (coda vs. ggmcmc) I make the “Source” variable contain two columns on the “Rhat” value.

comp.l <- comp %>%
  spread(Source, Rhat)
comp.l
## # A tibble: 700 x 5
##    Parameter Version Length  coda ggmcmc
##    <fct>     <chr>   <chr>  <dbl>  <dbl>
##  1 alpha[1]  BDA2    Long   1.01   1.00 
##  2 alpha[1]  BDA2    Short  0.980  0.978
##  3 alpha[1]  BG98    Long   1.01   1.01 
##  4 alpha[1]  BG98    Short  0.980  0.980
##  5 alpha[2]  BDA2    Long   0.997  0.995
##  6 alpha[2]  BDA2    Short  0.989  0.975
##  7 alpha[2]  BG98    Long   0.997  0.997
##  8 alpha[2]  BG98    Short  0.989  0.989
##  9 alpha[3]  BDA2    Long   1.00   0.999
## 10 alpha[3]  BDA2    Short  0.989  0.975
## # … with 690 more rows
ggplot(comp.l, aes(x = coda, y = ggmcmc)) +
  geom_point() +
  facet_grid(Version ~ Length)
Comparison of Rhat for 175 parameters, by Chain length and version of the calculation.

Figure 1: Comparison of Rhat for 175 parameters, by Chain length and version of the calculation.

Of course, the coda package uses only the “BG98” version and therefore the second row maps perfectly the “coda” and the “ggmcmc” versions. But when ggmcmc is used and the two versions to calculate the \(\hat{R}\) are directly compared it can be seen that the “BDA2” version tends to produce values closer to 1, specially when the chain is longer.

In fact, the “rule of thumb” is to look at Parameters with \(\hat{R} > 1.1\). According to this, in this example there are several cases in the “BG98” version, a few in the “BDA2” using a really, really short chain (only 20 iterations) and none in the long chain (top left panel).

I would say that the “BDA2” version is more liberal, and that if you are interested in having less certainty of non-convergence (remember that Bayesians can never be sure of convergence, but only collect evidence for lack of non-convergence) or applying a more restrictive / conservative convergence test, the “BG98” version is preferred.

Another way to see it is to compare the differences between the “BDA2” and “BG98” (calculated as a percentage of “BG98” over “BDA2”).

bda2.diff <- comp %>%
  filter(Source == "ggmcmc") %>%
  spread(Version, Rhat) %>%
  mutate(p.diff = BG98 / BDA2)

ggplot(bda2.diff, aes(x = p.diff, fill = Length, color = Length)) + 
  geom_density(alpha = 0.4)
Comparison of the percentage of difference between the two versions of the Rhat according to the length of the chain (Long, 100 iterations, and Short, 20 iterations), for the radon dataset.

Figure 2: Comparison of the percentage of difference between the two versions of the Rhat according to the length of the chain (Long, 100 iterations, and Short, 20 iterations), for the radon dataset.

Shorter chains, not surprisingly, have higher differences between the two versions of the estimator, arriving in this example to 20 percent. But for the “Long” chains (relative, as in this case the length is only 100 iterations), most of the deviations are below 3 percent, and very unlikely above 5 percent.

Final remarks

For a Bayesian data analysis involving more than, say, 100 iterations, there is going to be no virtually no difference in the Potential Scale Reduction Factor or \(\hat{R}\) as calculated using the “BDA2” or the “BG98” version. But for extreme cases with a low number of iterations (and chains), the most conservative measure is the “BG98”. In any case, the function ggs_Rhat() in ggmcmc allows you to pick which version of the \(\hat{R}\) you want.

Citation

When usind the ggmcmc don’t forget to cite it using: (Fernández-i-Marín 2016)

Brooks, Stephen P, and Andrew Gelman. 1998. “General Methods for Monitoring Convergence of Iterative Simulations.” Journal of Computational and Graphical Statistics 7 (4): 434–55.

Fernández-i-Marín, Xavier. 2016. “Ggmcmc: Analysis of Mcmc Samples and Bayesian Inference.” Journal of Statistical Software 70 (1): 1–20. https://doi.org/10.18637/jss.v070.i09.

Gelman, Andrew, John B. Carlin, Hal S. Stern, and Donald B. Rubin. 1995. Bayesian Data Analysis. Boca Ratón, FL: Hardcover; Chapman & Hall/CRC.

Gelman, Andrew, and Donald B Rubin. 1992. “Inference from Iterative Simulation Using Multiple Sequences.” Statistical Science, 457–72.

Families and batches of parameters in Bayesian inference: how to treat them using ggmcmc

Author: Xavier Fernández-i-Marín
February 22, 2019 - 10 minutes
Tutorial of how to deal with families of parameters in Bayesian inference using ggmcmc
ggmcmc R Bayesian

V-Dem by default: Load and process V-Dem democracy scores in R

Author: Xavier Fernández-i-Marín
January 11, 2019 - 6 minutes
Tutorial of how to perform data analysis of the Varieties of Democracies (V-Dem) dataset in R
Governance R Data visualization

Bavarian state elections 2018 - Evaluation of the forecasts

Author: Xavier Fernández-i-Marín
October 16, 2018 - 3 minutes
Comparison of the electoral results and forecasts for Bavarian elections 2018
Electoral forecast Bayesian Data visualization
comments powered by Disqus