I have tried several times to use the read.spss.format()
function from the package memisc
, maintained by Martin Elff, but this function has problems with the CIS files.
My approach is to use another layer of software: PSPP. Yes, it is another complication, but PSPP works fine and from the command line, which makes things easier.
In addition to that, I also tend to recode files into UTF-8 from the very beginning, so I work from the original files transformed from ISO-8859-1 to UTF-8. The tool is recode.
So this is what I do in order to introduce CIS datasets into R.
I will use the survey 2384 for postelectoral general elections in 2000.
recode latin1..utf8 ES2384
pspp ES2384
Then within R the old ES2394 text file is now a SAV file, and so it can be read using read.spss()
.
library(foreign) d <- read.spss("ES2384", to.data.frame=TRUE)]]>
So the elections have produced a new Catalan government with less seats for the previous winning party (CiU), but substantially more seats for the two left-wing parties that also support a referendum on Catalan sovereignty -ERC and ICV-). The difference in the interpretation of the results between Spanish and International press is astonishing, with Catalan press more alinged with the interpretation of International press. See Divisive Election in Spain’s Catalonia Gives Win to Separatist Parties at the New York Times, or this editorial from the Financial Times, Barcelona’s draw, that emphasizes that two thirds of the parliament supports a referendum.
The following figure shows the results of the elections in vote share for each party (black dot and discontinuous black line), along with the prediction of the pooling the polls model and its uncertainty, as well as the point predictions and errors associated with each of the individual polls.
It is worth mentioning that, except for CiU, all parties have at least one poll that predicts their actual result within the error margin.
The model, also, is capable of guessing the tendency of the polls.
As for how well it behaves the pooling the polls model, I have done some comparisons using two different measures of precision of the polls:
MND: Mean of the normalized differences (also, or more well known as the Mean Absolute Percentage Error). The idea is that the real result of a party in the election is compared to the predicted value in the poll and a percent of different is extracted. Then, a mean is computed for all the parties in the poll. The result can, therefore, be interpreted as the mean difference in percentage between the results and the predictions. So a value of MND=0.2 means that the predictions for this poll diverge, on average, by 20 percent from the results. The MND is appropriate when there are polls that do not have predictions for all parties, since the missing values do not bias the results. However, MND gives equal weight to four times a difference of 10% or a single difference of 40% and three exact predictions.
SQD: Sum of the squares of the differences. Here the idea is to try to penalize larger differences in the predictions. The difference between the prediction and the actual value is squared. All those differences in the polls are then added and the square root is taken. The SQD is not so appropriate when there are parties without prediction, because it assumes that the difference is zero. But is a more fine-grained tool to measure big divergences between the predicted value and the result, because it penalizes big differences in the distribution of supports.
The function in R to compute those values is:
mnd <- function(pred, res) mean(abs(1-(pred/res)), na.rm=TRUE) sqd <- function(pred, res) sqrt(sum(((res-pred)^2), na.rm=TRUE))
The following figure shows the values of the mean of the normalized differences for all polls with a fieldwork in the last two months before the elections, as well as the relationships between the MND and the sample size of the poll and its date of fieldwork.
The figure shows that the model does not do a very good job, or at least only slightly better than a simple average of the polls. This is because no penalization is done on larger differences.
When the sum of the squares of the differences is used, the results change substantively. The model is, by far, the best prediction. This is due to the penalization for larger differences. So the model seems to be the best overall compromise for all parties.
It is also worth mentioning that the two largest surveys (the CIS and the CEO, the two official survey bodies from Spain and Catalonia, respectively) have very different performance. The CIS does a very good job (as it is traditional), while the CEO is performing really bad. It is worth mentioning that this is the third time that the CEO does an estimation of the final vote share. So “cooking” the plain results has to be substantively improved at CEO.
It is also important to notice that as the date of the fieldwork is approaching the election date the surveys tend to be more accurate. So, once again, it does not seem to be very reasonable to have this law that does not allow to publish survey results since a week before the election day.
The performance of the translation of the vote share to seats has proved to be quite good. So my fears and doubts about it have diminished, but not faded away absolutely.
The combination of the MND and the SQD suggest that the model performs quite well for big parties and less well for small parties. Also, that the model is an evident improvement of a simple plain mean.
Let me emphasize again this point: pooling the polls does not do wizardry with the predictions. It is only a way to take a sophisticated mean of the polls. But obviously all its virtues rely only on the quality of the sources: the polls. If no polls were indicating evidence of less support for the winning party, the sophistication of the model can’t compensate for it.
]]>The electoral forecasting, updated to the last moment predicts the following percentages of vote share:
It differs from the last prediction in two ways. First, it has more uncertainty, because there have not been polls in the last week. And second, it shows the effect of the tendency of the campaign as computed using the days -20 to -10 to the voting day.
But many things can go wrong. Although the model did quite well two years ago for the same elections and a year ago for spanish elections (see one of the first posts of the blog: ), I have some doubts of its performance for today.
First, the volatility of the supports can be higher this time, due to the repositioning of parties in the space and the fact that the agenda has been almost centered in the referendum for becoming a new European state than on the budget constrains that have been addressed by lower social services.
Second, the percentage of citizens that didn’t know who to vote has been up to a third of them a couple of weeks before the election. Polling houses have had a hard time guessing the distribution of this uncertainty, and again, it does not seem very likely that learning from the past may help too much this time.
Finally, what scares me most is the translation of vote shares into seats. Let me emphasize again that the prediction of shares is based in two strong predictions: equal participation and equal distribution of vote by parties between districts. The evidence right now, coming from postal voting up so far, indicates that participation will increase, which may have an important effect in some districts with regards to the entry of small parties. And as for the distribution of vote by parties, I’m not absolutely sure that such a strong restriction can work, but it is what we have right now.
I have to admit, however, that the translation into seats gives a total of 134 (from a total of 135 seats) when the highest probability number of seats by party is taken, which is a relief given the way that the translation has been done.
But I’m sure that many, many more things can be different and ruin the electoral forecasting. I will talk about it tomorrow.
]]>Although the accuracy of the forecasting in previous elections was quite high, specially for the parties with more that 10% of the support, there are still two areas in which the model can fail terribly. First, in small parties. There is lot of noise in the estimation of the support for small parties. And second, the translation of the support received in final seats in the chamber. The algorithm has not been tested enough, and it relies in two assumptions that can make it fail miserably (namely, the turnout being the same, and the distribution of the support being also the same by districts). So I’m scared about it.
All in all, the most likely scenario from the forecasting is a maintenance of the support for the currently governing party, CiU (36.5% against 38.43% in 2010), with a similar number of seats. The real unknown, however, is who will lead the opposition. The silver medal is quiet disputed, with PSC, PP and ERC tightly fighting for it (although coming from very different initial positions). The green party (ICV) slightly improving results as well, and the two small parties with very different forecast (SI disappearing from the chamber and Cs doubling its representation).
That scenario would imply that the right keeps the majority. The topic for the next two years, however, will likely be the referendum for secession, in which case the parties that support such referendum have a very strong majority.
]]>R allows drawing maps in a quick and easy way, integrating also the visualization of data with the process of data management and model interpretation.
The rworldmap is the library that allows both joining country data with the shapefiles of each country and plotting the desired variables.
Here it goes a simple example using an easy and invented dataset:
library(rworldmap) d <- data.frame( country=c("Greece", "France", "Germany", "Italy", "Norway"), value=c(-2, -1, 0, 1, 2))
The data must be converted into an object that rworldmap
understands:
n <- joinCountryData2Map(d, joinCode="NAME", nameJoinColumn="country")
where joinCode
can be to match countries by name, by their ISO code or other ways; nameJoinColumn
is the name of the column that belongs to the country in the ‘d’ object.
The call to joinCountryData2Map
gives some information about the success or failure of the matching procedure, and we can then correct the names of the countries to match. Sometimes this can be tedious. So it is always convenient to work with country names as much standardized as possible.
Then a world map with the default colours (heat scale) can be easily produced by the following code, where the argument nameColumnToPlot
is the name of the variable that we want to plot
mapCountryData(n, nameColumnToPlot="value", mapTitle="World")
We can then select areas of the world and different colours
mapCountryData(n, nameColumnToPlot="value", mapTitle="Eurasia", mapRegion="Eurasia", colourPalette="terrain")
Or set the limits by using latitude and longitude, assign colours to the ocean and to missing values, and use a black and white palette.
mapCountryData(n, nameColumnToPlot="value", mapTitle="Europe, manual", xlim=c(-10, 40), ylim=c(35, 70), colourPalette="black2White", addLegend=TRUE, oceanCol="lightblue", missingCountryCol="yellow")]]>
First, local elections in Belgium have been interpreted by the winning party in the Flemish part in regional terms, aiming for the evaporation of Belgium. Second, Cameron and Salmond have signed an agreement by which the United Kingdom and Scotland agree about the question to be asked in the Scottish referendum in 2014. Finally, the Catalan prime minister, in a calculated electoral move, has also mentioned the question that he would like to ask in a referenda in Catalonia if the wins the next Catalan elections to be held in November 25th.
Precisely the following Catalan elections are object of prediction using the “pooling the polls” method. As I did for the Spanish elections in 2011, you can consult all the predictions at forecasting Catalan elections 2012 and subscribe to the blog via RSS in order to get the latest updates.
]]>In order to make Stan and ggmcmc work together there are two possibilities, depending on how Stan is being used, either inside R with rstan or as a standalone program.
In R we can easily create a fake dataset to play with.
N <- 500 X <- runif(N) y <- 3 + 5 * X + rnorm(N)
This creates 500 data points from a plain simple linear model with intercept=3,slope=5 and standard deviation=1 (which is the default for rnorm()
).
Data can be easily passed to Stan via a list:
D <- list(N=N, X=X, y=y)
The model declaration is stored simply as an R character object
md <- ' data { int<lower=0> N; real X[N]; real y[N]; } parameters { real theta[2]; real<lower=0> sigma; } model { for (n in 1:N) { y[n] ~ normal(theta[1] + theta[2] * X[n], sigma); } } '
There are three parts in this declaration. The first one (data) declares the data that will be used. The type of variables can (must) be specified. In this case N
is an integer with a minimum value of 0, since it will loop across observations. X
and y
will be declared as any real number.
The second part (parameters) declares the prior specifications of the parameters to be sampled.
The third part (model) is really the core of the model specification. It declares the relationships between data and parameters. Prior specifications of the parameters can also be included here. In this case y
is distributed normally with mean equal to an arithmetic relationship between the intercept, the slope and the data, and standard deviation equals to sigma
. Notice that, contrary to JAGS and *Bugs, there is no more need to specify the dispersion parameter in terms of its precision (inverse of the variance).
Now, with the data and the model ready we can run stan inside R
library(rstan) m <- stan(model_code=md, data=D)
By default, the stan
function call runs 4 chains of length 2000, with a warm-up interval of half the length (hence, 1000). Notice that Stan refers to “warm-up” instead of “burn-in” period, used in JAGS and *Bugs.
In order to access the samples we can use the extract
method, discarding the warm-up period.
s <- extract(m, inc_warmup=FALSE)
s
is now an array with dimensions [1:n_iter, 1:n_chains, 1:n_parameters], without warm-up.
Then s can be converted into a typical coda
object by
s <- mcmc.list(mcmc(s[,1,-4]), mcmc(s[,2,-4]), mcmc(s[,3,-4]), mcmc(s[,4,-4])) # Or a cleaner way thanks to Sebastian Weber: s <- do.call(mcmc.list, alply(s[,,-4], 2, mcmc)) # Or even a more general way, thanks to Forrest Stevens s <- do.call(mcmc.list, alply(s[,,-(length(s[1,1,]))]], 2, mcmc))
Finally the mcmc list can be easily transformed in an object that ggmcmc understands using a call to ggs()
.
S <- ggs(s)
Now the ggmcmc functions can be called using S
as the argument: ggs_histogram, ggs_traceplot, ggs_density, ggs_compare_partial, ggs_running, ggs_autocorrelation, ggs_crosscorrelation, ggs_caterpillar
Or generating a single pdf file with all the output:
ggmcmc(S)
A traceplot (or any of the other plots that ggmcmc provides) of the samples can be easily obtained with
ggs_traceplot(S)
In order to run Stan without using the rstan
function the model and the data must be written to disk
write(md, file="model.stan") dump(c("N", "X", "y"), file="data.Rdump")
Then the usual steps required to run Stan follow, namely the conversion of stan code to C++ code, the compilation of the code, and the execution of the generated binary against the data. Please refer to the manual for the execution in your specific platform. In my Gentoo GNU/Linux system with Stan installed in /usr/local/stan/
I simply run the following in the command line
stanc --name=model --o=model.cpp model.stan g++ -march=native -O3 -L/usr/local/stan/bin \ -I /usr/local/stan/src \ -I /usr/local/stan/lib/boost_1.51.0/ \ -I /usr/local/stan/lib/eigen_3.1.1 \ model.cpp -o model -lstan
To generate two chains:
./model --data=data.Rdump --iter=50000 --thin=1 --refresh=10000 \ --samples=s_1.csv --chain_id=1 ./model --data=data.Rdump --iter=50000 --thin=1 --refresh=10000 \ --samples=s_2.csv --chain_id=2
The chains can be imported easily as a table and transformed into an mcmc object with the following lines:
library(coda) s1 <- read.table("s_1.csv", sep=",", header=TRUE) s2 <- read.table("s_2.csv", sep=",", header=TRUE) s.mcmc <- mcmc.list(mcmc(s1[,-c(1:3)]), mcmc(s2[,-c(1:3)]))
The -c(1:3)
avoids loading Stan internal variables.
Finally, the s.mcmc
object can also be converted into a ggmcmc object:
library(ggmcmc) S <- ggs(s.mcmc)
My experience with Stan with respect to execution times is mixed, with some examples giving valid samples in short period of time and others taking much more than JAGS. I will do more extensive benchmarks and post the results here.
]]>The classical approximation which dates back to the 1970’s on a seminal article by Michael Wallace and David Singer titled “International Governmental Organization in the Global System, 1815-1964” is to consider an organization to be an International Governmental Organization if it has the following three characteristics:
This excludes treaties between states and legal agreements, and all the , but it also excludes “objects” such as the G20, which is usually regarded as a leading institution in the international arena.
The following figure shows the number of IGOs that were formed by many members (high values in the x-axis) or few (low values in the x-axis). And it does so for 4 different points in time: in 1870, in 1930, in 1970 and in 2005. The data is based on the “full members” of IGO according to the latest version (2.3) of the International Governmental Organizations database maintained by Timothy Nordstrom (University of Mississippi), Jon Pevehouse (University of Wisconsin) and Megan Shannon (University of Mississippi) and hosted at the Correlates of War Program and described at an article on “Conflict Management and Peace Science”
Few points are important from this figure:
Not necessarily. But it may happen that the seeds for a global Governance mechanisms are yet in the international arena. Countries have generated a myriad of institutions to try to solve sometimes very specific problems. We know that the tendency is to have more and more countries joining those arrangementts. So it is not very strange to think of those solutions as a rudimentary set of tools ready to be used to arrive to a more depth agreemens between societies.
The following code imports the IGO COW dataset, does some basic processing and arranges the data in a way that is easy to manage. The code that generates the 2005 snapshot is also provided. The process to generate the video is equal than in a previous post.
require(ggplot2) # Read data d <- read.table("IGO_stateunit_v2.3.csv", head=TRUE, sep=",") igo <- d[,-c(1, 3)] # Generate status factors migo <- melt(igo, id=c("country", "year")) names(migo)[3:4] <- c("igo", "status") table(migo$status) migo$status <- factor(migo$status, levels=c(0, 1, 2, 3, -9, -1), labels=c("no membership", "full membership", "associate membership", "observer", "NA", "IGO not in existence")) migo <- cbind(migo, fm=ifelse(migo$status=="full membership", 1, 0)) # Full membership migo.fm <- tapply(migo$fm, list(migo$year, migo$igo), sum) migo.fm <- melt(migo.fm) names(migo.fm) <- c("year", "igo", "fm") migo.fm <- subset(migo.fm, fm>0) # only when IGOs have at least some members # Do the plot t <- subset(migo.fm, year=="1880") p1 <- ggplot(t, aes(x=fm)) + xlim(0, 200) + ylim(0, 70) + geom_histogram(binwidth=5) + xlab("Number of full membership countries") + ylab("Number of IGO") + opts(title="1880") t <- subset(migo.fm, year=="1930") p2 <- ggplot(t, aes(x=fm)) + xlim(0, 200) + ylim(0, 70) + geom_histogram(binwidth=5) + xlab("Number of full membership countries") + ylab("Number of IGO") + opts(title="1930") t <- subset(migo.fm, year=="1970") p3 <- ggplot(t, aes(x=fm)) + xlim(0, 200) + ylim(0, 70) + geom_histogram(binwidth=5) + xlab("Number of full membership countries") + ylab("Number of IGO") + opts(title="1970") t <- subset(migo.fm, year=="2005") p4 <- ggplot(t, aes(x=fm)) + xlim(0, 200) + ylim(0, 70) + geom_histogram(binwidth=5) + xlab("Number of full membership countries") + ylab("Number of IGO") + opts(title="2005") grid.arrange(p1, p2, p3, p4, ncol=2, nrow=2)
The program is licensed under a Free Software license (GPLv3), which means that you have the right to use, study and modify it. The only restriction is that if you make modifications you have to share them under the same circumstances and make the source code available.
The program is writen in Python, so a working python installation is required. Python is also Free Software. In addition to being free as in freedom, it is also free as in “free beer”, so you can download it at no cost. It runs in many different platforms and there are pre-packaged versions for the most used operating systems at Get Python. In GNU/Linux I recommend to use the python version that comes packaged with your distribution.
In order to get a spreadsheet with the results, the script also uses a non-standard python library called xlwt. Python libraries (modules) can be easily installed. Consult the guide Installing Python Modules for further assistance.
You have to copy the “process_nici.py” script in the top of the directory hierarchy that you want to analize. From then, you must run the program invoking it from the command line and with an argument, which represents the name of the files that must be processed (in this example, quan1.txt).
$ python process_nici.py quan1.txt
The program will print the status of the operations and it may return an error message if it detects something wrong during the process.
The program returns two files: out.csv and compounds.xls.
d <- read.table("out.csv", sep=";", header=TRUE)
The concentrations below the calibration are assigned value -1, and the non-detected concentrations are assigned value -2. This is useful for later processing the signals.