Processing Stan samples with ggmcmc

Stan is a relative newcomer in the scenario of Bayesian tools aimed at provide samples from posteriors. But despite its novelty, it offers many interesting features for Bayesians. Amongst the features, which will probably be discussed here in the future, the possibility of having a sampler in a really fast programming language (C++) makes many of us dream of shorter times to run our models.

In order to make Stan and ggmcmc work together there are two possibilities, depending on how Stan is being used, either inside R with rstan or as a standalone program.

Create a fake dataset

In R we can easily create a fake dataset to play with.

N <- 500
X <- runif(N)
y <- 3 + 5 * X + rnorm(N)

This creates 500 data points from a plain simple linear model with intercept=3,slope=5 and standard deviation=1 (which is the default for rnorm()).

Data can be easily passed to Stan via a list:

D <- list(N=N, X=X, y=y)

The model declaration is stored simply as an R character object

md <- '
data {
  int<lower=0> N;
  real X[N];
  real y[N];
}
parameters {
  real theta[2];
  real<lower=0> sigma;
}
model {
  for (n in 1:N) {
    y[n] ~ normal(theta[1] + theta[2] * X[n], sigma);
  }
}
'

There are three parts in this declaration. The first one (data) declares the data that will be used. The type of variables can (must) be specified. In this case N is an integer with a minimum value of 0, since it will loop across observations. X and y will be declared as any real number.

The second part (parameters) declares the prior specifications of the parameters to be sampled.

The third part (model) is really the core of the model specification. It declares the relationships between data and parameters. Prior specifications of the parameters can also be included here. In this case y is distributed normally with mean equal to an arithmetic relationship between the intercept, the slope and the data, and standard deviation equals to sigma. Notice that, contrary to JAGS and *Bugs, there is no more need to specify the dispersion parameter in terms of its precision (inverse of the variance).

Running Stan inside R with rstan

Now, with the data and the model ready we can run stan inside R

library(rstan)
m <- stan(model_code=md, data=D)

By default, the stan function call runs 4 chains of length 2000, with a warm-up interval of half the length (hence, 1000). Notice that Stan refers to “warm-up” instead of “burn-in” period, used in JAGS and *Bugs.

In order to access the samples we can use the extract method, discarding the warm-up period.

s <- extract(m, inc_warmup=FALSE) 

s is now an array with dimensions [1:n_iter, 1:n_chains, 1:n_parameters], without warm-up.

Then s can be converted into a typical coda object by

s <- mcmc.list(mcmc(s[,1,-4]), mcmc(s[,2,-4]), mcmc(s[,3,-4]), mcmc(s[,4,-4]))
# Or a cleaner way thanks to Sebastian Weber:
s <- do.call(mcmc.list, alply(s[,,-4], 2, mcmc))
# Or even a more general way, thanks to Forrest Stevens
s <- do.call(mcmc.list, alply(s[,,-(length(s[1,1,]))]], 2, mcmc))

Finally the mcmc list can be easily transformed in an object that ggmcmc understands using a call to ggs().

S <- ggs(s)

Now the ggmcmc functions can be called using S as the argument: ggs_histogram, ggs_traceplot, ggs_density, ggs_compare_partial, ggs_running, ggs_autocorrelation, ggs_crosscorrelation, ggs_caterpillar

Or generating a single pdf file with all the output:

ggmcmc(S)

A traceplot (or any of the other plots that ggmcmc provides) of the samples can be easily obtained with

ggs_traceplot(S)

traceplot-ggmcmc.png

Running Stan alone

In order to run Stan without using the rstan function the model and the data must be written to disk

write(md, file="model.stan")
dump(c("N", "X", "y"), file="data.Rdump")

Then the usual steps required to run Stan follow, namely the conversion of stan code to C++ code, the compilation of the code, and the execution of the generated binary against the data. Please refer to the manual for the execution in your specific platform. In my Gentoo GNU/Linux system with Stan installed in /usr/local/stan/ I simply run the following in the command line

stanc --name=model --o=model.cpp model.stan
g++ -march=native -O3 -L/usr/local/stan/bin \
  -I /usr/local/stan/src \
  -I /usr/local/stan/lib/boost_1.51.0/ \
  -I /usr/local/stan/lib/eigen_3.1.1 \
  model.cpp -o model -lstan

To generate two chains:

./model --data=data.Rdump --iter=50000 --thin=1 --refresh=10000 \
  --samples=s_1.csv --chain_id=1
./model --data=data.Rdump --iter=50000 --thin=1 --refresh=10000 \
  --samples=s_2.csv --chain_id=2

The chains can be imported easily as a table and transformed into an mcmc object with the following lines:

library(coda)
s1 <- read.table("s_1.csv", sep=",", header=TRUE)
s2 <- read.table("s_2.csv", sep=",", header=TRUE)
s.mcmc <- mcmc.list(mcmc(s1[,-c(1:3)]), mcmc(s2[,-c(1:3)]))

The -c(1:3) avoids loading Stan internal variables.

Finally, the s.mcmc object can also be converted into a ggmcmc object:

library(ggmcmc)
S <- ggs(s.mcmc)

Is Stan fast?

My experience with Stan with respect to execution times is mixed, with some examples giving valid samples in short period of time and others taking much more than JAGS. I will do more extensive benchmarks and post the results here.

This entry was posted in Bayes and tagged , , , . Bookmark the permalink.

3 Responses to "Processing Stan samples with ggmcmc"

Leave a Reply

Your email address will not be published. Required fields are marked *


*