Five days to start the Barcelona Workshop on Global Governance

After several months co-organizing it, finally next week it will become a reality. The Barcelona Workshop on Global Governance will take place, with a program that includes leading academic lectures (Barbara Koremenos, George Tsebelis and Jonathan Koppell), practitioners, and also a lot of promising papers.

January 14, at IBEI
BCN_Workshop_Global_Governance_Program-0.png

January 15, at ESADE
BCN_Workshop_Global_Governance_Program-1.png

Posted in Governance | Tagged , | Leave a comment

Reading data from CIS into R

The Spanish Center for Sociological Research (CIS, Centro de Investigaciones Sociológicas) provides its data to researchers in SAS and SPSS formats. I will not enter today into to discussion and need to provide publicly funded data in open formats, but will only cover the technical workaround that R users have to do to be able to easily import the datasets.

I have tried several times to use the read.spss.format() function from the package memisc, maintained by Martin Elff, but this function has problems with the CIS files.

My approach is to use another layer of software: PSPP. Yes, it is another complication, but PSPP works fine and from the command line, which makes things easier.

In addition to that, I also tend to recode files into UTF-8 from the very beginning, so I work from the original files transformed from ISO-8859-1 to UTF-8. The tool is recode.

So this is what I do in order to introduce CIS datasets into R.

I will use the survey 2384 for postelectoral general elections in 2000.

Recode the original file

recode latin1..utf8 ES2384

Run PSPP to generate a SAV SPSS file

pspp ES2384

Load the SAV file from R

Then within R the old ES2394 text file is now a SAV file, and so it can be read using read.spss().

library(foreign)
d <- read.spss("ES2384", to.data.frame=TRUE)

Posted in Tips | Tagged , , | Leave a comment

Catalan elections 2012: evaluating electoral forecasting

I have been tempted to use “The magnitude of the tragedy” as a title for the post, quoting Quim Monzó, a well know catalan writter. Mostly because the impression, the perception, is that nobody predicted such results.

So the elections have produced a new Catalan government with less seats for the previous winning party (CiU), but substantially more seats for the two left-wing parties that also support a referendum on Catalan sovereignty -ERC and ICV-). The difference in the interpretation of the results between Spanish and International press is astonishing, with Catalan press more alinged with the interpretation of International press. See Divisive Election in Spain’s Catalonia Gives Win to Separatist Parties at the New York Times, or this editorial from the Financial Times, Barcelona’s draw, that emphasizes that two thirds of the parliament supports a referendum.

Results

The following figure shows the results of the elections in vote share for each party (black dot and discontinuous black line), along with the prediction of the pooling the polls model and its uncertainty, as well as the point predictions and errors associated with each of the individual polls.

gr-partits-separats-periode_rellevant-resultats.png

It is worth mentioning that, except for CiU, all parties have at least one poll that predicts their actual result within the error margin.
The model, also, is capable of guessing the tendency of the polls.

An evaluation of the behaviour of the “pooling the polls” model

As for how well it behaves the pooling the polls model, I have done some comparisons using two different measures of precision of the polls:

  • MND: Mean of the normalized differences (also, or more well known as the Mean Absolute Percentage Error). The idea is that the real result of a party in the election is compared to the predicted value in the poll and a percent of different is extracted. Then, a mean is computed for all the parties in the poll. The result can, therefore, be interpreted as the mean difference in percentage between the results and the predictions. So a value of MND=0.2 means that the predictions for this poll diverge, on average, by 20 percent from the results. The MND is appropriate when there are polls that do not have predictions for all parties, since the missing values do not bias the results. However, MND gives equal weight to four times a difference of 10% or a single difference of 40% and three exact predictions.

  • SQD: Sum of the squares of the differences. Here the idea is to try to penalize larger differences in the predictions. The difference between the prediction and the actual value is squared. All those differences in the polls are then added and the square root is taken. The SQD is not so appropriate when there are parties without prediction, because it assumes that the difference is zero. But is a more fine-grained tool to measure big divergences between the predicted value and the result, because it penalizes big differences in the distribution of supports.

The function in R to compute those values is:

mnd <- function(pred, res) mean(abs(1-(pred/res)), na.rm=TRUE)
sqd <- function(pred, res) sqrt(sum(((res-pred)^2), na.rm=TRUE))

The following figure shows the values of the mean of the normalized differences for all polls with a fieldwork in the last two months before the elections, as well as the relationships between the MND and the sample size of the poll and its date of fieldwork.

fig-mnd.png

The figure shows that the model does not do a very good job, or at least only slightly better than a simple average of the polls. This is because no penalization is done on larger differences.

When the sum of the squares of the differences is used, the results change substantively. The model is, by far, the best prediction. This is due to the penalization for larger differences. So the model seems to be the best overall compromise for all parties.

fig-sqd.png

It is also worth mentioning that the two largest surveys (the CIS and the CEO, the two official survey bodies from Spain and Catalonia, respectively) have very different performance. The CIS does a very good job (as it is traditional), while the CEO is performing really bad. It is worth mentioning that this is the third time that the CEO does an estimation of the final vote share. So “cooking” the plain results has to be substantively improved at CEO.
It is also important to notice that as the date of the fieldwork is approaching the election date the surveys tend to be more accurate. So, once again, it does not seem to be very reasonable to have this law that does not allow to publish survey results since a week before the election day.

Translation from vote share to seats

The performance of the translation of the vote share to seats has proved to be quite good. So my fears and doubts about it have diminished, but not faded away absolutely.

To sum up

The combination of the MND and the SQD suggest that the model performs quite well for big parties and less well for small parties. Also, that the model is an evident improvement of a simple plain mean.

Let me emphasize again this point: pooling the polls does not do wizardry with the predictions. It is only a way to take a sophisticated mean of the polls. But obviously all its virtues rely only on the quality of the sources: the polls. If no polls were indicating evidence of less support for the winning party, the sophistication of the model can’t compensate for it.

Posted in Uncategorized | Tagged , , | 4 Comments

Electoral forecasting for 2012 catalan elections: what can go wrong.

So finally election day has come. Let me stress again the historical importance of today’s elections, which can be noticed by the interest of international press (The New York Times, The Guardian, Le Monde, The Wall Street Journal, The Times, to name but a few).

The electoral forecasting, updated to the last moment predicts the following percentages of vote share:

gr-prediccio_avui-caterpillar.png

It differs from the last prediction in two ways. First, it has more uncertainty, because there have not been polls in the last week. And second, it shows the effect of the tendency of the campaign as computed using the days -20 to -10 to the voting day.

But many things can go wrong. Although the model did quite well two years ago for the same elections and a year ago for spanish elections (see one of the first posts of the blog: ), I have some doubts of its performance for today.

First, the volatility of the supports can be higher this time, due to the repositioning of parties in the space and the fact that the agenda has been almost centered in the referendum for becoming a new European state than on the budget constrains that have been addressed by lower social services.

Second, the percentage of citizens that didn’t know who to vote has been up to a third of them a couple of weeks before the election. Polling houses have had a hard time guessing the distribution of this uncertainty, and again, it does not seem very likely that learning from the past may help too much this time.

Finally, what scares me most is the translation of vote shares into seats. Let me emphasize again that the prediction of shares is based in two strong predictions: equal participation and equal distribution of vote by parties between districts. The evidence right now, coming from postal voting up so far, indicates that participation will increase, which may have an important effect in some districts with regards to the entry of small parties. And as for the distribution of vote by parties, I’m not absolutely sure that such a strong restriction can work, but it is what we have right now.

gr-prediccio_avui-escons.png

I have to admit, however, that the translation into seats gives a total of 134 (from a total of 135 seats) when the highest probability number of seats by party is taken, which is a relief given the way that the translation has been done.

But I’m sure that many, many more things can be different and ruin the electoral forecasting. I will talk about it tomorrow.

Posted in Elections | Tagged , , | Leave a comment

Final electoral forecasting for catalan elections 2012

After the last 5 polls yesterday we finally have the final forecasting for the catalan elections. Again, it is worth mentioning the historical importance of such elections, both at the Catalan, Spanish and International level.

Although the accuracy of the forecasting in previous elections was quite high, specially for the parties with more that 10% of the support, there are still two areas in which the model can fail terribly. First, in small parties. There is lot of noise in the estimation of the support for small parties. And second, the translation of the support received in final seats in the chamber. The algorithm has not been tested enough, and it relies in two assumptions that can make it fail miserably (namely, the turnout being the same, and the distribution of the support being also the same by districts). So I’m scared about it.

gr-prediccio_avui-caterpillar.png

All in all, the most likely scenario from the forecasting is a maintenance of the support for the currently governing party, CiU (36.5% against 38.43% in 2010), with a similar number of seats. The real unknown, however, is who will lead the opposition. The silver medal is quiet disputed, with PSC, PP and ERC tightly fighting for it (although coming from very different initial positions). The green party (ICV) slightly improving results as well, and the two small parties with very different forecast (SI disappearing from the chamber and Cs doubling its representation).

That scenario would imply that the right keeps the majority. The topic for the next two years, however, will likely be the referendum for secession, in which case the parties that support such referendum have a very strong majority.

See the forecasting for catalan elections 2012.

Posted in Elections | Tagged , , | 7 Comments

Using R to draw maps on country data

When doing research in comparative politics, International Relations or world governance is often convenient to have an idea of the spatial distribution of certain characteristics of countries. Therefore, being able to draw maps is a useful technique that any researcher or practitioner must have in his or her toolbox.

R allows drawing maps in a quick and easy way, integrating also the visualization of data with the process of data management and model interpretation.

The rworldmap is the library that allows both joining country data with the shapefiles of each country and plotting the desired variables.

Here it goes a simple example using an easy and invented dataset:

library(rworldmap)
d <- data.frame(
  country=c("Greece", "France", "Germany", "Italy", "Norway"),
  value=c(-2, -1, 0, 1, 2))

The data must be converted into an object that rworldmap understands:

n <- joinCountryData2Map(d, joinCode="NAME", nameJoinColumn="country")

where joinCode can be to match countries by name, by their ISO code or other ways; nameJoinColumn is the name of the column that belongs to the country in the ‘d’ object.

The call to joinCountryData2Map gives some information about the success or failure of the matching procedure, and we can then correct the names of the countries to match. Sometimes this can be tedious. So it is always convenient to work with country names as much standardized as possible.

Then a world map with the default colours (heat scale) can be easily produced by the following code, where the argument nameColumnToPlot is the name of the variable that we want to plot

mapCountryData(n, nameColumnToPlot="value", mapTitle="World")

map-world.png

We can then select areas of the world and different colours

mapCountryData(n, nameColumnToPlot="value", mapTitle="Eurasia",
  mapRegion="Eurasia", colourPalette="terrain")

map-eurasia.png

Or set the limits by using latitude and longitude, assign colours to the ocean and to missing values, and use a black and white palette.

mapCountryData(n, nameColumnToPlot="value", mapTitle="Europe, manual",
  xlim=c(-10, 40), ylim=c(35, 70),
  colourPalette="black2White", 
  addLegend=TRUE,
  oceanCol="lightblue", missingCountryCol="yellow")

map-europe.png

Posted in Spatial Models | 2 Comments

The EU internal enlargement: elections in Flanders, the Scottish referendum and polls for elections in Catalonia and

This weekend three different regions in Europe have coincided to bring the discussion of a hypothetical EU internal enlargement by means of the most elementary tool of democracy: voting.

First, local elections in Belgium have been interpreted by the winning party in the Flemish part in regional terms, aiming for the evaporation of Belgium. Second, Cameron and Salmond have signed an agreement by which the United Kingdom and Scotland agree about the question to be asked in the Scottish referendum in 2014. Finally, the Catalan prime minister, in a calculated electoral move, has also mentioned the question that he would like to ask in a referenda in Catalonia if the wins the next Catalan elections to be held in November 25th.

Precisely the following Catalan elections are object of prediction using the “pooling the polls” method. As I did for the Spanish elections in 2011, you can consult all the predictions at forecasting Catalan elections 2012 and subscribe to the blog via RSS in order to get the latest updates.

Posted in Elections | Leave a comment

Processing Stan samples with ggmcmc

Stan is a relative newcomer in the scenario of Bayesian tools aimed at provide samples from posteriors. But despite its novelty, it offers many interesting features for Bayesians. Amongst the features, which will probably be discussed here in the future, the possibility of having a sampler in a really fast programming language (C++) makes many of us dream of shorter times to run our models.

In order to make Stan and ggmcmc work together there are two possibilities, depending on how Stan is being used, either inside R with rstan or as a standalone program.

Create a fake dataset

In R we can easily create a fake dataset to play with.

N <- 500
X <- runif(N)
y <- 3 + 5 * X + rnorm(N)

This creates 500 data points from a plain simple linear model with intercept=3,slope=5 and standard deviation=1 (which is the default for rnorm()).

Data can be easily passed to Stan via a list:

D <- list(N=N, X=X, y=y)

The model declaration is stored simply as an R character object

md <- '
data {
  int<lower=0> N;
  real X[N];
  real y[N];
}
parameters {
  real theta[2];
  real<lower=0> sigma;
}
model {
  for (n in 1:N) {
    y[n] ~ normal(theta[1] + theta[2] * X[n], sigma);
  }
}
'

There are three parts in this declaration. The first one (data) declares the data that will be used. The type of variables can (must) be specified. In this case N is an integer with a minimum value of 0, since it will loop across observations. X and y will be declared as any real number.

The second part (parameters) declares the prior specifications of the parameters to be sampled.

The third part (model) is really the core of the model specification. It declares the relationships between data and parameters. Prior specifications of the parameters can also be included here. In this case y is distributed normally with mean equal to an arithmetic relationship between the intercept, the slope and the data, and standard deviation equals to sigma. Notice that, contrary to JAGS and *Bugs, there is no more need to specify the dispersion parameter in terms of its precision (inverse of the variance).

Running Stan inside R with rstan

Now, with the data and the model ready we can run stan inside R

library(rstan)
m <- stan(model_code=md, data=D)

By default, the stan function call runs 4 chains of length 2000, with a warm-up interval of half the length (hence, 1000). Notice that Stan refers to “warm-up” instead of “burn-in” period, used in JAGS and *Bugs.

In order to access the samples we can use the extract method, discarding the warm-up period.

s <- extract(m, inc_warmup=FALSE) 

s is now an array with dimensions [1:n_iter, 1:n_chains, 1:n_parameters], without warm-up.

Then s can be converted into a typical coda object by

s <- mcmc.list(mcmc(s[,1,-4]), mcmc(s[,2,-4]), mcmc(s[,3,-4]), mcmc(s[,4,-4]))
# Or a cleaner way thanks to Sebastian Weber:
s <- do.call(mcmc.list, alply(s[,,-4], 2, mcmc))
# Or even a more general way, thanks to Forrest Stevens
s <- do.call(mcmc.list, alply(s[,,-(length(s[1,1,]))]], 2, mcmc))

Finally the mcmc list can be easily transformed in an object that ggmcmc understands using a call to ggs().

S <- ggs(s)

Now the ggmcmc functions can be called using S as the argument: ggs_histogram, ggs_traceplot, ggs_density, ggs_compare_partial, ggs_running, ggs_autocorrelation, ggs_crosscorrelation, ggs_caterpillar

Or generating a single pdf file with all the output:

ggmcmc(S)

A traceplot (or any of the other plots that ggmcmc provides) of the samples can be easily obtained with

ggs_traceplot(S)

traceplot-ggmcmc.png

Running Stan alone

In order to run Stan without using the rstan function the model and the data must be written to disk

write(md, file="model.stan")
dump(c("N", "X", "y"), file="data.Rdump")

Then the usual steps required to run Stan follow, namely the conversion of stan code to C++ code, the compilation of the code, and the execution of the generated binary against the data. Please refer to the manual for the execution in your specific platform. In my Gentoo GNU/Linux system with Stan installed in /usr/local/stan/ I simply run the following in the command line

stanc --name=model --o=model.cpp model.stan
g++ -march=native -O3 -L/usr/local/stan/bin \
  -I /usr/local/stan/src \
  -I /usr/local/stan/lib/boost_1.51.0/ \
  -I /usr/local/stan/lib/eigen_3.1.1 \
  model.cpp -o model -lstan

To generate two chains:

./model --data=data.Rdump --iter=50000 --thin=1 --refresh=10000 \
  --samples=s_1.csv --chain_id=1
./model --data=data.Rdump --iter=50000 --thin=1 --refresh=10000 \
  --samples=s_2.csv --chain_id=2

The chains can be imported easily as a table and transformed into an mcmc object with the following lines:

library(coda)
s1 <- read.table("s_1.csv", sep=",", header=TRUE)
s2 <- read.table("s_2.csv", sep=",", header=TRUE)
s.mcmc <- mcmc.list(mcmc(s1[,-c(1:3)]), mcmc(s2[,-c(1:3)]))

The -c(1:3) avoids loading Stan internal variables.

Finally, the s.mcmc object can also be converted into a ggmcmc object:

library(ggmcmc)
S <- ggs(s.mcmc)

Is Stan fast?

My experience with Stan with respect to execution times is mixed, with some examples giving valid samples in short period of time and others taking much more than JAGS. I will do more extensive benchmarks and post the results here.

Posted in Bayes | Tagged , , , | 3 Comments

The seeds of Global Governance

International Governmental Organizations (IGOs) can be thought of as the seeds for a new mechanisms of Global Governance. They represent deals between countries to try to solve diverse issues. Some of them are formed by few countries, and others by most of them. Some IGOs are general, and others focused on very specific topics. Some of them only accept regional members, while others are open to all the international community. But giving all those differences, they serve a purpose individually. And if we consider them together, we can think about them as the seeds of a kind of new “international government”.

How to measure an IGO

The classical approximation which dates back to the 1970’s on a seminal article by Michael Wallace and David Singer titled “International Governmental Organization in the Global System, 1815-1964” is to consider an organization to be an International Governmental Organization if it has the following three characteristics:

  • membership at least three members from the system of states
  • life it must hold one plenary session at least every ten years
  • formal it must have a permanent secretariat and headquarters

This excludes treaties between states and legal agreements, and all the , but it also excludes “objects” such as the G20, which is usually regarded as a leading institution in the international arena.

The evolution of IGO

The following figure shows the number of IGOs that were formed by many members (high values in the x-axis) or few (low values in the x-axis). And it does so for 4 different points in time: in 1870, in 1930, in 1970 and in 2005. The data is based on the “full members” of IGO according to the latest version (2.3) of the International Governmental Organizations database maintained by Timothy Nordstrom (University of Mississippi), Jon Pevehouse (University of Wisconsin) and Megan Shannon (University of Mississippi) and hosted at the Correlates of War Program and described at an article on “Conflict Management and Peace Science”

Few points are important from this figure:

  • With no doubt, the number of IGO in the international arena has increased, as shown by the increasing area painted black.
  • It is also worth mentioning that the process hasn’t been linear, but it has exploted in recent years.
  • The plot also shows a somewhat bimodal distribution in 2005, suggesting that IGO are either formed by less than 50 members or by more than 150. There are not many IGOs with an intermediate number of members.
  • The fact that most of the IGOs have less that 20 members also suggests that although the seeds for global governance can be found in IGOs, the majority of them are so small (with respect to their representativity in terms of current members of the international system), that it is hard to imagine that have a “global” impact beyond the reduced number of full members.
  • Another way to see the process is with the following video, which shows year by year (5-years until 1960) evolution of the number of full members in the IGOs. In addition to the previous points, it is also interesting to visualize the impact in the membership of IGOs of two different processes of country formation: the decolonization in the 1960s, and the fall of the USSR and the creation of new countries in the 90s (the hunch aroud 150 moves towards 190 in the 1990s).

    Do more IGO mean that we live in a better world?

    Not necessarily. But it may happen that the seeds for a global Governance mechanisms are yet in the international arena. Countries have generated a myriad of institutions to try to solve sometimes very specific problems. We know that the tendency is to have more and more countries joining those arrangementts. So it is not very strange to think of those solutions as a rudimentary set of tools ready to be used to arrive to a more depth agreemens between societies.

    Code to generate the plots

    The following code imports the IGO COW dataset, does some basic processing and arranges the data in a way that is easy to manage. The code that generates the 2005 snapshot is also provided. The process to generate the video is equal than in a previous post.

    require(ggplot2)
    
    # Read data
    d <- read.table("IGO_stateunit_v2.3.csv", head=TRUE, sep=",")
    igo <- d[,-c(1, 3)]
    
    # Generate status factors
    migo <- melt(igo, id=c("country", "year"))
    names(migo)[3:4] <- c("igo", "status")
    table(migo$status)
    migo$status <- factor(migo$status,
      levels=c(0, 1, 2, 3, -9, -1),
      labels=c("no membership", "full membership", "associate membership",
        "observer", "NA", "IGO not in existence"))
    migo <- cbind(migo, fm=ifelse(migo$status=="full membership", 1, 0))
    
    # Full membership
    migo.fm <- tapply(migo$fm, list(migo$year, migo$igo), sum)
    migo.fm <- melt(migo.fm)
    names(migo.fm) <- c("year", "igo", "fm")
    migo.fm <- subset(migo.fm, fm>0)               # only when IGOs have at least some members
    
    # Do the plot
    t <- subset(migo.fm, year=="1880")
    p1 <- ggplot(t, aes(x=fm)) + xlim(0, 200) + ylim(0, 70) + geom_histogram(binwidth=5) +
      xlab("Number of full membership countries") + ylab("Number of IGO") + opts(title="1880")
    t <- subset(migo.fm, year=="1930")
    p2 <- ggplot(t, aes(x=fm)) + xlim(0, 200) + ylim(0, 70) + geom_histogram(binwidth=5) +
      xlab("Number of full membership countries") + ylab("Number of IGO") + opts(title="1930")
    t <- subset(migo.fm, year=="1970")
    p3 <- ggplot(t, aes(x=fm)) + xlim(0, 200) + ylim(0, 70) + geom_histogram(binwidth=5) +
      xlab("Number of full membership countries") + ylab("Number of IGO") + opts(title="1970")
    t <- subset(migo.fm, year=="2005")
    p4 <- ggplot(t, aes(x=fm)) + xlim(0, 200) + ylim(0, 70) + geom_histogram(binwidth=5) +
      xlab("Number of full membership countries") + ylab("Number of IGO") + opts(title="2005")
    grid.arrange(p1, p2, p3, p4, ncol=2, nrow=2)
    
    
Posted in Governance | Tagged , , | Leave a comment

Processing the Quantification of Samples from a NICI

The program process_nici processes quantification files manually integrated from a gas chromatograph coupled to a mass spectrometer operating in negative ion chemical ionization mode (GC-MS-NICI, 6980N Agilent Technologies) into a meaningful output that can be directly exported to statistical packages and spreadsheets. This kind of file is generated from ChemStation Software and it contains the concentration of specific compounds.

Conditions

The program is licensed under a Free Software license (GPLv3), which means that you have the right to use, study and modify it. The only restriction is that if you make modifications you have to share them under the same circumstances and make the source code available.

Requirements

The program is writen in Python, so a working python installation is required. Python is also Free Software. In addition to being free as in freedom, it is also free as in “free beer”, so you can download it at no cost. It runs in many different platforms and there are pre-packaged versions for the most used operating systems at Get Python. In GNU/Linux I recommend to use the python version that comes packaged with your distribution.

In order to get a spreadsheet with the results, the script also uses a non-standard python library called xlwt. Python libraries (modules) can be easily installed. Consult the guide Installing Python Modules for further assistance.

How to run the program

You have to copy the “process_nici.py” script in the top of the directory hierarchy that you want to analize. From then, you must run the program invoking it from the command line and with an argument, which represents the name of the files that must be processed (in this example, quan1.txt).

$ python process_nici.py quan1.txt

The program will print the status of the operations and it may return an error message if it detects something wrong during the process.

Results

The program returns two files: out.csv and compounds.xls.

  • out.csv is a text file with three columns. The first column represents the name of the sample as stated by the “Sample” or “Misc” field. The second column represents the compound, and the third column its concentration. This comma-separated file can be read by any spreadsheet or statistical package. In R, for example, you get a nice data frame by importing it with:
  • d <- read.table("out.csv", sep=";", header=TRUE)
    
  • compounds.xls is a spreadsheet of concentrations where columns represent samples and rows represent compounds.

The concentrations below the calibration are assigned value -1, and the non-detected concentrations are assigned value -2. This is useful for later processing the signals.

Characteristics and Limitations

  • As of version 1.1 the program reads the concentrations under the “Target Compounds” section. So the “Internal Standards” section is not processed.
  • The program compares the strings of “Sample” and “Misc” to get a number for the sample. If they do not match, it gives an error. If one of them is empty, the program uses the other as a valid name for the sample.
Posted in Uncategorized | Tagged | Leave a comment