Multilevel regression with poststratification (MRP) is a technique to obtain estimates of public opinion for subpopulations in specific geographic units that may be undersampled and underrepresented in data. This allows you to both estimate public opinion for these subpopulations, as well as to aggregate all group estimates by geographic unit to obtain estimates of overall public opinion within that unit. This lab is adapted from Kastellac et al.’s working paper (Kastellec, Lax, and Phillips 2016) and included example code which generates estimates of support for same sex marriage in each state based on five national polls from 2004.
library(lme4) # random effect models
library(foreign) # read .dta files
library(arm) # inverse logit function
# read in megapoll
read.dta('https://www.princeton.edu/~jkastell/MRP_primer/gay_marriage_megapoll.dta', convert.underscore = TRUE)
marriage_data <-
# read in state-level dataset
read.dta('https://www.princeton.edu/~jkastell/MRP_primer/state_level_update.dta', convert.underscore = TRUE)
Statelevel <- Statelevel[order(Statelevel$sstate.initnum), ]
Statelevel <-
# read in sensus data
read.dta('https://www.princeton.edu/~jkastell/MRP_primer/poststratification%202000.dta', convert.underscore = TRUE)
Census <- Census[order(Census$cstate), ]
Census <-$cstate.initnum <- match(Census$cstate, Statelevel$sstate) Census
We need to create common identifiers in both our ‘megapoll’ and census data so that we can properly poststratify (weight) our results by using the proportion of each group in each state. Each of these identifiers will be a factor variable which we can use to index random effects when generating our prediction cells later.
# from 1 for white males to 6 for hispanic females
$race.female <- (marriage_data$female * 3) + marriage_data$race.wbh
marriage_data
# from 1 for 18-29 with low edu to 16 for 65+ with high edu
$age.edu.cat <- 4 * (marriage_data$age.cat - 1) + marriage_data$edu.cat
marriage_data
# proportion of evangelicals in respondent's state
$p.evang.full <- Statelevel$p.evang[marriage_data$state.initnum]
marriage_data
# proportion of mormon's in respondent's state
$p.mormon.full <-Statelevel$p.mormon[marriage_data$state.initnum]
marriage_data
# combined evangelical + mormom proportions
$p.relig.full <- marriage_data$p.evang.full + marriage_data$p.mormon.full
marriage_data
# kerry's % of 2-party vote in respondent's state in 2004
$p.kerry.full <- Statelevel$kerry.04[marriage_data$state.initnum]
marriage_data
# same coding as above
$crace.female <- (Census$cfemale * 3) + Census$crace.WBH
Census$cage.edu.cat <- 4 * (Census$cage.cat - 1) + Census$cedu.cat
Census$cp.evang.full<- Statelevel$p.evang[Census$cstate.initnum]
Census$cp.mormon.full <- Statelevel$p.mormon[Census$cstate.initnum]
Census$cp.relig.full <- Census$cp.evang.full + Census$cp.mormon.full
Census$cp.kerry.full <- Statelevel$kerry.04[Census$cstate.initnum] Census
Next we run our individual level model to predict whether a given respondent supports marriage equality or not. The response is a function of an intercept term, and various demograhpic effects, indexed by \(i\) for individuals, and \(j\), \(k\), \(l\), \(m\), \(s\) and \(p\) for race and gender combination, age, education, state, and poll respectively. The model also includes an age \(\times\) education term. Race-gender, age, and education are all factor variables indicating membership in a specific category.
\[ \text{PR}(y = 1 ) = \text{logit}^{-1}(\beta_0 + \alpha_{j[i]}^{race,gender} + \alpha_{k[i]}^{age} + \alpha_{l[i]}^{edu} + \alpha_{k[i],l[i]}^{age:edu} + \alpha_{s[i]}^{state} + \alpha_{p[i]}^{year}) \]
The varying group effects are modeled as drawn from a normal distribution with mean zero and an estimated variance:
\[ \begin{align} \alpha_j^{race,gender} &\sim \mathcal{N}(0, \sigma_{race,gender}^2),~\text{for}~j = 1, \ldots, 6 \\ \alpha_k^{age} &\sim \mathcal{N}(0, \sigma_{age}^2),~\text{for}~k = 1, \ldots, 4 \\ \alpha_l^{edu} &\sim \mathcal{N}(0, \sigma_{edu}^2),~\text{for}~l = 1, \ldots, 4 \\ \alpha_{k,l}^{age:edu} &\sim \mathcal{N}(0, \sigma_{age:edu}^2),~\text{for}~k = 1, \ldots, 4~\text{and}~l = 1, \ldots, 4 \\ \alpha_m^{region} &\sim \mathcal{N}(0, \sigma_{region}^2),~\text{for}~m = 1, \ldots 5 \\ \alpha_p^{year} &\sim \mathcal{N}(0, \sigma_{year}^2),~\text{for}~p = 1, \ldots \end{align} \]
The state effects are slightly more complicated. Instead of being drawn from a normal distribution with mean zero, these are modeled as being drawn from a normal distribution with a mean that is a function of a linear combination of state-level variables, in this case religiosity and John Kerry’s voteshare.
\[ \begin{align} \alpha_s^{state} &\sim \mathcal{N}(\mu_{state}, \sigma_{state}^2)~\text{for}~s = 1,\ldots,51 \\ \mu_{state} &= \alpha_{m[s]}^{region} + \gamma_1 relig_s + \gamma_2 presvote_s \end{align} \]
# model with demographic and geographic random effects
glmer(formula = yes.of.all ~ (1 | race.female) + (1 | age.cat)
individual_model <-+ ( 1 |edu.cat) + (1 | age.edu.cat) + (1 | state) + (1 | region)
+ (1 | poll) + p.relig.full + p.kerry.full, data = marriage_data,
family = binomial(link = 'logit'))
Let’s examine some random effects to see if they match with our expectations.
data.frame(Education = ranef(individual_model)$edu.cat,
df <-Age = ranef(individual_model)$age.cat)
colnames(df) <- c('Education', 'Age')
df
These certainly do, with increasing education having a positive effect on the probability of supporting marriage equality, and increasing age having a negative effect.
We need to create a vector of state random effects becuase Alaska and Hawaii are not in our data, so we can’t just use the state random effects from our individual level model. Instead, we extract the state random effects and then set the missing Alaskan and Hawaiian random effects to zero (the Bayesian in me says that we probably have enough prior information about the political climates in these states to set slight negative and positive random effects, respectively…).
# empty vector to hold state random effects
array(NA, c(51, 1))
state_ranefs <-
# set state names as row names
dimnames(state_ranefs) <- list(c(Statelevel$sstate), 'effect')
# assign state random effects to array while preserving NAs
for (i in Statelevel$sstate) {
ranef(individual_model)$state[i, 1]
state_ranefs[i, ] <-
}
# set states with missing REs (b/c not in data) to zero
1][is.na(state_ranefs[, 1])] <- 0 state_ranefs[,
Next we need to create a ‘prediction cell’ for every possible combination of demographic-state effects in our census data. We have 96 possible demographic combinations \(\times\) 51 states = 4896 cells. This step is exactly like creating a hypothetical variable profile for simulation based predicted probabilities, except that instead of varying one variable and holding all others at their central tendency or a meaningful quantity, we are varying all variables to get the predicted probability of every demographic combination in every state supporting marriage equality.
# create a prediction for each cell in Census data
invlogit(fixef(individual_model)['(Intercept)']
cellpred <-+ ranef(individual_model)$race.female[Census$crace.female, 1]
+ ranef(individual_model)$age.cat[Census$cage.cat, 1]
+ ranef(individual_model)$edu.cat[Census$cedu.cat, 1]
+ ranef(individual_model)$age.edu.cat[Census$cage.edu.cat, 1]
+ state_ranefs[Census$cstate, 1]
+ ranef(individual_model)$region[Census$cregion, 1]
+ (fixef(individual_model)['p.relig.full'] * Census$cp.relig.full)
+ (fixef(individual_model)['p.kerry.full'] * Census$cp.kerry.full))
We then weight each cell’s predicted probability by the proportion of each state’s population that it represents.
# weight each cell's prediction by its frequency w/in its state
cellpred * Census$cpercent.state cellpredweighted <-
Finally, aggregate each demographic category’s proportion of supporters by state, and multiply by 100 to obtain the percent of people in each state estimated to support marriage equality.
# sum all weighted cell predictions by state, and convert to percent
100 * as.vector(tapply(cellpredweighted,Census$cstate,sum))
statepred <-
# collect states, estimates, and proportion religious
data.frame(State = Statelevel$sstate,
est_gg <-Estimate = statepred,
Religion = Statelevel$p.evang + Statelevel$p.mormon)
# present estimates by state in datafame, minus religion
-3] est_gg[,
library(ggplot2) # plots
library(plotly) # interactive plots
# plot estimates, colored by religion, label to avoid reorder() in plotly
ggplot(est_gg, aes(x = reorder(State, Estimate), y = Estimate,
est_plot <-color = Religion, label = State)) +
geom_point(size = 2.5) +
labs(x = 'State', y = 'Estimated Support') +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5),
axis.text.x = element_text(angle=90),
legend.position = 'right',
plot.background = element_blank(),
panel.grid.minor = element_blank(),
panel.grid.major = element_blank(),
panel.border = element_blank())
# interactive plot; ggplotly needs colour, not color
ggplotly(est_plot, tooltip = c('label', 'y', 'colour'))
Use the simulation approach to generate point estimates and 95% confidence intervals for the proportion of people within each state supporting marriage equality. Use at least 10 simulations.
Hint: the simulate()
function is your friend because it has a method for objects of class merMod
, which allows it to simulate new responses from the fitted model object – but be sure to account for the random effects in the model. You want to carry out the MRP process multiple times, substituting in a different simulated outcome vector each time. You can accomplish this either with a loop, or through a series of calls to apply()
with anonymous functions.
Kastellec, Jonathan P., Jeffrey R. Lax, and Justin Phillips. 2016. “Estimating State Public Opinion with Multi-Level Regression and Poststratification Using R.” Working Paper.