Processing Stan samples with ggmcmc
Author: Xavier Fernández-i-Marín
October 16, 2012 - 4 minutes
Bayesian mcmc R Stan ggmcmcStan is a relative newcomer in the scenario of Bayesian tools aimed at provide samples from posteriors. But despite its novelty, it offers many interesting features for Bayesians. Amongst the features, which will probably be discussed here in the future, the possibility of having a sampler in a really fast programming language (C++) makes many of us dream of shorter times to run our models.
In order to make Stan and ggmcmc work together there are two possibilities, depending on how Stan is being used, either inside R with rstan or as a standalone program.
Create a fake dataset
In R we can easily create a fake dataset to play with.
This creates 500 data points from a plain simple linear model with intercept=3,slope=5 and standard deviation=1 (which is the default for rnorm()
).
Data can be easily passed to Stan via a list:
The model declaration is stored simply as an R character object
There are three parts in this declaration. The first one (data) declares the data that will be used. The type of variables can (must) be specified. In this case N
is an integer with a minimum value of 0, since it will loop across observations. X
and y
will be declared as any real number.
The second part (parameters) declares the prior specifications of the parameters to be sampled.
The third part (model) is really the core of the model specification. It declares the relationships between data and parameters. Prior specifications of the parameters can also be included here. In this case y
is distributed normally with mean equal to an arithmetic relationship between the intercept, the slope and the data, and standard deviation equals to sigma
. Notice that, contrary to JAGS and *Bugs, there is no more need to specify the dispersion parameter in terms of its precision (inverse of the variance).
Running Stan inside R with rstan
Now, with the data and the model ready we can run stan inside R
By default, the stan
function call runs 4 chains of length 2000, with a warm-up interval of half the length (hence, 1000). Notice that Stan refers to “warm-up” instead of “burn-in” period, used in JAGS and *Bugs.
In order to access the samples we can use the extract
method, discarding the warm-up period.
s
is now an array with dimensions [1:n_iter, 1:n_chains, 1:n_parameters], without warm-up.
Then s can be converted into a typical coda
object by
Finally the mcmc list can be easily transformed in an object that ggmcmc understands using a call to ggs()
.
Now the ggmcmc functions can be called using S
as the argument: ggs_histogram, ggs_traceplot, ggs_density, ggs_compare_partial, ggs_running, ggs_autocorrelation, ggs_crosscorrelation, ggs_caterpillar
Or generating a single pdf file with all the output:
A traceplot (or any of the other plots that ggmcmc provides) of the samples can be easily obtained with
Running Stan alone
In order to run Stan without using the rstan
function the model and the data must be written to disk
Then the usual steps required to run Stan follow, namely the conversion of stan code to C++ code, the compilation of the code, and the execution of the generated binary against the data. Please refer to the manual for the execution in your specific platform. In my Gentoo GNU/Linux system with Stan installed in /usr/local/stan/
I simply run the following in the command line
To generate two chains:
The chains can be imported easily as a table and transformed into an mcmc object with the following lines:
The -c(1:3)
avoids loading Stan internal variables.
Finally, the s.mcmc
object can also be converted into a ggmcmc object:
Is Stan fast?
My experience with Stan with respect to execution times is mixed, with some examples giving valid samples in short period of time and others taking much more than JAGS. I will do more extensive benchmarks and post the results here.