Agency proliferation: processing the dataset of institutional characteristics of regulatory agencies - RegGov 2018
Author: Xavier Fernández-i-Marín
December 17, 2018 - 10 minutesRegulatory agencies Governance Data visualization
This tutorial shows how to process the dataset of institutional characteristics of regulatory agencies in
R. It presents the dataset presented at the journal article entitled “Agency proliferation and the globalization of the regulatory state: Introducing a data set on the institutional features of regulatory agencies”, by Jacint Jordana, Xavier Fernández-i-Marín and Andrea C. Bianculli, published at Regulation & Governance (December 2018).
The necessary packages in
R to follow the instructions are the following:
library(dplyr) # For the functions involving pipes ("%>%") library(ggplot2) # For the figures library(tidyr) # For arranging variables
The code shown below makes extensive use of pipes (
%>%) for more flexible, simple and readable coding.
Load the dataset
The dataset can be found in long and wide formats. Recall that the data contains scores that are normalized (mean zero and standard deviation one) and correspond to the position of the institution in 2010.
The long format has a row for every institution / dimension, with a column containing the value of the score. The first four columns (
Dimension) are identifiers of the data, whereas the data is stored only in the last column (
url.l <- "http://xavier-fim.net/publication/reggov-2018/ra-reggov_2018-institutions-long.csv" # ra.l: regulatory agencies in the long format ra.l <- read.table(url.l, header = TRUE, sep = ";", check.names = FALSE) %>% tbl_df() str(ra.l)
## Classes 'tbl_df', 'tbl' and 'data.frame': 3196 obs. of 5 variables: ## $ Institution: Factor w/ 750 levels "Administration of Occupational Safety and Health ",..: 488 204 98 422 489 665 467 434 668 99 ... ## $ Coverage : Factor w/ 793 levels "Afghanistan :: Central Bank",..: 16 17 14 15 18 19 20 22 24 64 ... ## $ Cluster : Factor w/ 6 levels "#1 Ideal","#2 Constrained",..: 6 6 1 1 1 5 2 1 5 1 ... ## $ Dimension : Factor w/ 4 levels "Managerial autonomy",..: 1 1 1 1 1 1 1 1 1 1 ... ## $ score : num -0.26 -0.48 0.46 0.77 1.12 0.84 0.86 0.05 0.25 0.7 ...
## # A tibble: 6 x 5 ## Institution Coverage Cluster Dimension score ## <fct> <fct> <fct> <fct> <dbl> ## 1 National Regulatory Au… Argentina :: Electr… #6 Respo… Managerial … -0.26 ## 2 Drug, Food and Medical… Argentina :: Food S… #6 Respo… Managerial … -0.48 ## 3 Central Bank of Argent… Argentina :: Centra… #1 Ideal Managerial … 0.46 ## 4 National Commission fo… Argentina :: Compet… #1 Ideal Managerial … 0.77 ## 5 National Regulatory Au… Argentina :: Gas #1 Ideal Managerial … 1.12 ## 6 Superintendence of Hea… Argentina :: Health… #5 Auton… Managerial … 0.84
The wide format has a row for every institution, and the scores of the four dimensions are stored in different columns. The first three columns identify the observation (
Cluster), and the remaining four columns store the data (
Public accountability, and
url.w <- "http://xavier-fim.net/publication/reggov-2018/ra-reggov_2018-institutions-wide.csv" # ra.w: regulatory agencies in the wide format ra.w <- read.table(url.w, header = TRUE, sep = ";", check.names = FALSE) %>% tbl_df() str(ra.w)
## Classes 'tbl_df', 'tbl' and 'data.frame': 799 obs. of 7 variables: ## $ Institution : Factor w/ 750 levels "Administration of Occupational Safety and Health ",..: 1 2 3 4 5 6 7 8 9 10 ... ## $ Coverage : Factor w/ 793 levels "Afghanistan :: Central Bank",..: 313 75 3 4 697 245 42 627 369 376 ... ## $ Cluster : Factor w/ 6 levels "#1 Ideal","#2 Constrained",..: 1 2 4 4 3 2 6 5 6 4 ... ## $ Managerial autonomy : num 0.18 0.6 -0.63 0.38 -0.16 -0.49 -0.74 0.44 -0.34 -2 ... ## $ Political independence : num 0.38 0.89 -2.01 -1.82 0.57 0.77 0.31 -0.47 -0.48 -1.33 ... ## $ Public accountability : num 0.52 0.38 -1.6 -1.16 -0.7 0.39 -0.54 0.16 0.15 -0.4 ... ## $ Regulatory capabilities: num 0.03 -0.57 -0.78 0.75 -1.29 -0.93 0.01 0.42 0.03 0.05 ...
## # A tibble: 6 x 7 ## Institution Coverage Cluster `Managerial aut… `Political inde… ## <fct> <fct> <fct> <dbl> <dbl> ## 1 "Administr… Iceland… #1 Ide… 0.18 0.38 ## 2 Administra… Brazil … #2 Con… 0.6 0.89 ## 3 Afghan Ato… Afghani… #4 Dep… -0.63 -2.01 ## 4 Afghanista… Afghani… #4 Dep… 0.38 -1.82 ## 5 Agence Nat… Tunisia… #3 Mim… -0.16 0.570 ## 6 Agency for… France … #2 Con… -0.49 0.77 ## # … with 2 more variables: `Public accountability` <dbl>, `Regulatory ## # capabilities` <dbl>
Several characteristics of the clusters can be calculaded by aggregating the values of the institutions in each of them, such as the mean (
mean()) and the standard deviation (
cl.ch <- ra.l %>% # cl.ch: cluster characteristics group_by(Cluster, Dimension) %>% summarize(Mean = mean(score), SD = sd(score)) head(cl.ch)
## # A tibble: 6 x 4 ## # Groups: Cluster  ## Cluster Dimension Mean SD ## <fct> <fct> <dbl> <dbl> ## 1 #1 Ideal Managerial autonomy 0.614 0.277 ## 2 #1 Ideal Political independence 0.719 0.259 ## 3 #1 Ideal Public accountability 0.581 0.396 ## 4 #1 Ideal Regulatory capabilities 0.343 0.251 ## 5 #2 Constrained Managerial autonomy 0.123 0.496 ## 6 #2 Constrained Political independence 0.560 0.153
This can be processed and plotted.
ggplot(cl.ch, aes(x = Mean, y = SD, color = Cluster)) + geom_point() + facet_wrap(~ Dimension)
The figure shows that clusters with higher means in the dimension considered are also generally more homogeneous (as shown by the lower standard deviations), specially in the case of Managerial autonomy and Regulatory capabilities. It is less clear in Political independence and almost nonexistent in Public accountability.
Getting the coverage by country and sector
If the interest lies in having some sort of aggregated measures of country and sector characteristics, then the variable
Coverage provides the country(es) and sector/s that are covered by every regulatory agency.
Observations (institutions) must therefore be expanded to more cases in order to cover all the possible spaces (country/sector).
First we must separate the coverage by countries vs. sectors. Then we proceed to replicate the institutions by the each of the countries/sectors they are covering.
ra.cv <- ra.l %>% # ra.cv: regulatory agencies coverage # Separate the content of coverage in two different new variables separate(Coverage, c("countries", "sectors"), " :: ") %>% # create new entries for each observation based on the variable # sectors, that are generated using the multiple sectors present # in the original plural variable (sectors). # Rename the resulting variable into singular (Sector) separate_rows(sectors, sep = " - ") %>% rename(Sector = sectors) %>% separate_rows(countries, sep = " - ") %>% rename(Country = countries)
From the original object
ra.l with the following observations and variables:
##  3196 5
We obtain a new object
ra.cv that contains more observations (the institutions copied over and over to fit the number of sectors and countries being covered) and variables (the Country and Sector identifiers).
##  4732 6
## # A tibble: 6 x 6 ## Institution Country Sector Cluster Dimension score ## <fct> <chr> <chr> <fct> <fct> <dbl> ## 1 National Regulatory Aut… Argenti… Electrici… #6 Respo… Managerial … -0.26 ## 2 Drug, Food and Medical … Argenti… Food Safe… #6 Respo… Managerial … -0.48 ## 3 Drug, Food and Medical … Argenti… Pharmaceu… #6 Respo… Managerial … -0.48 ## 4 Central Bank of Argenti… Argenti… Central B… #1 Ideal Managerial … 0.46 ## 5 Central Bank of Argenti… Argenti… Financial… #1 Ideal Managerial … 0.46 ## 6 National Commission for… Argenti… Competiti… #1 Ideal Managerial … 0.77
This way, we have moved the dataset from being based on institutions to being based on spaces covered by regulation, which now makes it suitable to work with any kind of aggregations (country or sector based) that the research has in mind. Just simply be aware that the original one was developed, thought and designed for institutions, not spaces, and therefore any transformation must be adapted to the researcher’s theoretical arguments and empirical needs.
We can now ask questions like what are the sectors with higher medians in Regulatory capabilities?
|Securities and Exchange||0.085|
|Nuclear Safety and Radiological Protection||-0.360|
ra.cv %>% filter(Dimension == "Regulatory capabilities") %>% group_by(Sector) %>% summarize(`Sector median` = median(score)) %>% arrange(desc(`Sector median`))
Or we can generalize the study of medians to include all dimensions:
sm <- ra.cv %>% # sm: sector medians group_by(Dimension, Sector) %>% summarize(`Sector median` = median(score)) ggplot(sm, aes(x = `Sector median`, y = reorder(Sector, `Sector median`), color = Dimension)) + ylab("Sector") + geom_point()
Country aggregations are a bit more problematic than sector ones, given that the number of institutions in each country is obviously much more limited than the number of institutions in each sector. Therefore, we must proceed with care.
For instance, to get the number of institutions in the 15 countries with fewer institutions:
ra.cv %>% # Get the unique pairs of country * institution # so that we obtain institutions, not spaces select(Country, Institution) %>% unique() %>% # by country, simply count (using the n() function) the number group_by(Country) %>% summarize(`Number of regulatory agencies` = n()) %>% # order by ascending number of agencies # and report back only the first 10 cases arrange(`Number of regulatory agencies`) %>% slice(1:15)
|Country||Number of regulatory agencies|
|Korea, Democratic People’s Republic of||1|
|Lao People’s Democratic Republic||1|
|Syrian Arab Republic||2|
|Congo, the Democratic Republic of the||3|
Remember that the number of agencies do not necessarily has to match with the number of spaces (sectors) covered. So in this case we do almost exactly the same as before, but instead of counting agencies, we count spaces covered by such agencies.
ra.cv %>% # Get the unique pairs of country * sector # so that we obtain spaces, not spaces select(Country, Sector) %>% unique() %>% group_by(Country) %>% summarize(`Number of sectors covered` = n()) %>% arrange(`Number of sectors covered`) %>% slice(1:15)
|Country||Number of sectors covered|
|Congo, the Democratic Republic of the||3|
|Korea, Democratic People’s Republic of||3|
|Lao People’s Democratic Republic||3|
|Syrian Arab Republic||4|
We can also take a look at the aggregated scores by country in each of the dimensions, looking at their centrality (mean) and dispersion (standard deviation). Again, the aggregation function to use is entirely up to the researcher responsibility.
c.ch <- ra.cv %>% # c.ch: country characteristics group_by(Country, Dimension) %>% summarize(Mean = mean(score), SD = sd(score)) head(c.ch)
## # A tibble: 6 x 4 ## # Groups: Country  ## Country Dimension Mean SD ## <chr> <fct> <dbl> <dbl> ## 1 Afghanistan Managerial autonomy -0.05 0.694 ## 2 Afghanistan Political independence -0.888 1.30 ## 3 Afghanistan Public accountability -1.27 0.389 ## 4 Afghanistan Regulatory capabilities -0.192 0.796 ## 5 Algeria Managerial autonomy 0.0387 0.579 ## 6 Algeria Political independence -0.256 0.921
This can be processed and plotted.
ggplot(c.ch, aes(x = Mean, y = SD, color = Dimension, label = Country)) + geom_point() + geom_text(hjust = 0, nudge_x = 0.02) + facet_wrap(~ Dimension)
There does not seem to be any sort of association between the median scores of the country agencies’ in each dimension and the dispersion of them.
But also other stories can be read from this figure. For instance, for Political independence Yemen has a mean value in all their spaces, but the spaces seem to be very heterogeneous. Therefore, one may conclude that probably in Yemen the regulatory spaces covered by the agencies are either with a very high or with a low value in their Political independence.
##  "2019-02-22 11:34:04 CET"