Comparison of Rhat versions: clarifying formulas for the potential scale reduction factor and its implications
Author: Xavier Fernández-i-Marín
March 6, 2019 - 7 minutes
Description of how to calculate Rhat (Potential Scale Reduction Factor) for Bayesian convergence usind different formulas, and its impact on the length of the chains
ggmcmc R BayesianOne of the formal measures of convergence in Bayesian inference is the Potential Scale Reduction Factor (\(\hat{R}\), “Rhat”). It is a weighted average of the Between-chain and Within-chain variances. The idea is that the closer this weighted average is to 1, the fewer the evidence that the chains have not converged.
Multiple versions for the Rhat
The formula for the concrete weighted average, however, has changed, and there are multiple specifications of it. A more detailed discussion can be found in this post in stackexchange.com, but the main idea is the following:
- The first idea of a potential scale reduction factor appears in (Gelman and Rubin 1992)
- There is a minor change in the second version in (Brooks and Gelman 1998) that entails changing the part of the degrees of freedom from
df/(df-2)
to(df+3)/(df+1)
. This is the version that the R packagecoda
uses and it is possible to use inggmcmc
when using the argumentversion_Rhat = "BG98"
in theggs_Rhat()
function. Let’s call this the “BG98” version. - A newer version appears in the first version of Bayesian Data Analysis (Gelman et al. 1995). It does not include any sort of weighting by degrees of freedom.
- The most prevalent version, though is the one in the second version of the second edition of Bayesian Data Analysis (let’s call this the “BDA2” version), although paradoxically is not the one used in coda, neither in stan currently. But this may change in the future.
- In Bayesian Data Analysis, the third edition, the formula is equal to “BDA2”, but the difference is that the original chains are broken by half and treated as if there were twice the chains of half the original size.
ggmcmc
’s function ggs_Rhat()
contains (starting from version 1.3) an argument that allows the user to specify which version of the computation is used, the “BG98” or the “BDA2” (the only one available since now and the default one from now on).
Example using the radon data in ggmcmc
This example aims at clarifying the differences between both versions (“BR98” and “BDA2”) and their implications in different problems, specifically when comparing longer or shorter chains.
library(ggmcmc)
data(radon)
library(coda)
I am comparing the following: - Source: the Rhat as calculated by “coda” or “ggmcmc”. - Versions: the “BG98” and the “BDA2” - Chain Length: the original “radon” example contains only 100 iterations in 2 chains (“Long”), and the shorter version even fewer (20, “Short”).
The following lines simply arrange the Rhat values for all the parameters (175) in the “radon” object (an “mcmc.list”) in a tidy dataframe.
### Full output (100 iterations, "Long" length)
s.full <- radon$s.radon
gg.full <- ggs_Rhat(ggs(s.full), version_rhat = "BG98")$data
gg.full.bda2 <- ggs_Rhat(ggs(s.full), version_rhat = "BDA2")$data
coda.full <- gelman.diag(s.full, autoburnin = F, multivariate = F)$psrf
comp.full <- as.data.frame.table(coda.full,
responseName = "Rhat") %>%
tbl_df() %>%
filter(Var2 == "Point est.") %>%
select(-Var2) %>%
rename(Parameter = Var1) %>%
mutate(Source = "coda", Version = "BG98", Length = "Long")
comp.full <- bind_rows(comp.full, mutate(comp.full, Version = "BDA2")) %>%
bind_rows(mutate(gg.full, Source = "ggmcmc", Version = "BG98", Length = "Long")) %>%
bind_rows(gg.full.bda2 %>%
select(-B, -W, -wa) %>%
mutate(Source = "ggmcmc", Version = "BDA2", Length = "Long"))
### Part of the output (20 iterations, "Short" length)
s.short <- lapply(radon$s.radon, function(x) x[1:20,])
class(s.short) <- "mcmc.list"
gg.short <- ggs_Rhat(ggs(s.short), version_rhat = "BG98")$data
gg.short.bda2 <- ggs_Rhat(ggs(s.short), version_rhat = "BDA2")$data
coda.short <- gelman.diag(s.short, autoburnin = F, multivariate = F)$psrf
comp.short <- as.data.frame.table(coda.short,
responseName = "Rhat") %>%
tbl_df() %>%
filter(Var2 == "Point est.") %>%
select(-Var2) %>%
rename(Parameter = Var1) %>%
mutate(Source = "coda", Version = "BG98", Length = "Short")
comp.short <- bind_rows(comp.short, mutate(comp.short, Version = "BDA2")) %>%
bind_rows(mutate(gg.short, Source = "ggmcmc", Version = "BG98", Length = "Short")) %>%
bind_rows(gg.short.bda2 %>%
select(-B, -W, -wa) %>%
mutate(Source = "ggmcmc", Version = "BDA2", Length = "Short"))
comp <- bind_rows(comp.full, comp.short)
The comparison object comp
is a data frame with 1,400 observations (175
parameters * 2 Lengths * 2 Sources * 2 Versions).
comp
## # A tibble: 1,400 x 5
## Parameter Rhat Source Version Length
## <fct> <dbl> <chr> <chr> <chr>
## 1 alpha[1] 1.01 coda BG98 Long
## 2 alpha[2] 0.997 coda BG98 Long
## 3 alpha[3] 1.00 coda BG98 Long
## 4 alpha[4] 1.00 coda BG98 Long
## 5 alpha[5] 0.996 coda BG98 Long
## 6 alpha[6] 1.02 coda BG98 Long
## 7 alpha[7] 1.00 coda BG98 Long
## 8 alpha[8] 0.999 coda BG98 Long
## 9 alpha[9] 0.998 coda BG98 Long
## 10 alpha[10] 1.01 coda BG98 Long
## # … with 1,390 more rows
In order to focus on a comparison at the “Source” level (coda vs. ggmcmc) I make the “Source” variable contain two columns on the “Rhat” value.
comp.l <- comp %>%
spread(Source, Rhat)
comp.l
## # A tibble: 700 x 5
## Parameter Version Length coda ggmcmc
## <fct> <chr> <chr> <dbl> <dbl>
## 1 alpha[1] BDA2 Long 1.01 1.00
## 2 alpha[1] BDA2 Short 0.980 0.978
## 3 alpha[1] BG98 Long 1.01 1.01
## 4 alpha[1] BG98 Short 0.980 0.980
## 5 alpha[2] BDA2 Long 0.997 0.995
## 6 alpha[2] BDA2 Short 0.989 0.975
## 7 alpha[2] BG98 Long 0.997 0.997
## 8 alpha[2] BG98 Short 0.989 0.989
## 9 alpha[3] BDA2 Long 1.00 0.999
## 10 alpha[3] BDA2 Short 0.989 0.975
## # … with 690 more rows
ggplot(comp.l, aes(x = coda, y = ggmcmc)) +
geom_point() +
facet_grid(Version ~ Length)
Of course, the coda
package uses only the “BG98” version and therefore the
second row maps perfectly the “coda” and the “ggmcmc” versions.
But when ggmcmc
is used and the two versions to calculate the \(\hat{R}\) are
directly compared it can be seen that the “BDA2” version tends to produce values
closer to 1, specially when the chain is longer.
In fact, the “rule of thumb” is to look at Parameters with \(\hat{R} > 1.1\). According to this, in this example there are several cases in the “BG98” version, a few in the “BDA2” using a really, really short chain (only 20 iterations) and none in the long chain (top left panel).
I would say that the “BDA2” version is more liberal, and that if you are interested in having less certainty of non-convergence (remember that Bayesians can never be sure of convergence, but only collect evidence for lack of non-convergence) or applying a more restrictive / conservative convergence test, the “BG98” version is preferred.
Another way to see it is to compare the differences between the “BDA2” and “BG98” (calculated as a percentage of “BG98” over “BDA2”).
bda2.diff <- comp %>%
filter(Source == "ggmcmc") %>%
spread(Version, Rhat) %>%
mutate(p.diff = BG98 / BDA2)
ggplot(bda2.diff, aes(x = p.diff, fill = Length, color = Length)) +
geom_density(alpha = 0.4)
Shorter chains, not surprisingly, have higher differences between the two versions of the estimator, arriving in this example to 20 percent. But for the “Long” chains (relative, as in this case the length is only 100 iterations), most of the deviations are below 3 percent, and very unlikely above 5 percent.
Final remarks
For a Bayesian data analysis involving more than, say, 100 iterations, there is
going to be no virtually no difference in the Potential Scale Reduction Factor
or \(\hat{R}\) as calculated using the “BDA2” or the “BG98” version. But for
extreme cases with a low number of iterations (and chains), the most
conservative measure is the “BG98”.
In any case, the function ggs_Rhat()
in ggmcmc
allows you to pick which
version of the \(\hat{R}\) you want.
Citation
When usind the ggmcmc
don’t forget to cite it using:
(Fernández-i-Marín 2016)
Brooks, Stephen P, and Andrew Gelman. 1998. “General Methods for Monitoring Convergence of Iterative Simulations.” Journal of Computational and Graphical Statistics 7 (4): 434–55.
Fernández-i-Marín, Xavier. 2016. “Ggmcmc: Analysis of Mcmc Samples and Bayesian Inference.” Journal of Statistical Software 70 (1): 1–20. https://doi.org/10.18637/jss.v070.i09.
Gelman, Andrew, John B. Carlin, Hal S. Stern, and Donald B. Rubin. 1995. Bayesian Data Analysis. Boca Ratón, FL: Hardcover; Chapman & Hall/CRC.
Gelman, Andrew, and Donald B Rubin. 1992. “Inference from Iterative Simulation Using Multiple Sequences.” Statistical Science, 457–72.