As with all statistical analysis, one of the biggest parts of text analysis is data preparation. We’re going to be using a subset of the data from Greene and Cross (2017), which are European Parliament speeches. These data are about as well cleaned and prepared as we could hope for, but we still have some work ahead of us. The data are available here.

# download speeches and metadata
download.file('https://erdos.ucd.ie/files/europarl/europarl-data-speeches.zip',
              'europarl-data-speeches.zip')

download.file('https://erdos.ucd.ie/files/europarl/europarl-metadata.zip',
              'europarl-metadata.zip')

# extract speeches and metadata
unzip('europarl-data-speeches.zip')
unzip('europarl-metadata.zip')

Loading Text Data

Often we get get text data pre-processed where each document is contained in an individual text tile in subdirectories that represent authors, speakers, or some other meaningful grouping. In this case, we have speeches, in days, in months, in years. Take a look at the readtext() function in the pacakge of the same name, and combine it with list.files() to easily load all the speeches from 2009 and 2010 into a dataframe that quanteda can read.

library(readtext) # easily read in text in directories

# recursively get filepaths for speeches from 09-12
speeches_paths <- list.files(path = c('europarl-data-speeches/2009',
                                      'europarl-data-speeches/2010'),
                             recursive = T, full.names = T)

# read in speeches
speeches <- readtext(speeches_paths)

Creating a Corpus

The corpus() function in quanteda expects a dataframe where each row contains a document ID and the text of the document itself. Luckily, this is exactly what readtext() has produced for us.

library(quanteda)
speeches <- corpus(speeches)

Quanteda supports two kinds of external information that we can attach to each document in a corpus; metadata and document variables. Metadata are not intended to be used in any analyses, and are often more just notes to the researcher containing things like the source URL for a document, or which language it was originally written in.

metadoc(speeches, field = 'type') <- 'European Parliament Speech'

Document variables, on the other hand, are the structure that we want to be able to use in our structural topic model. Read in the speech and speaker metadata files, merge the speaker data onto the speech data, subset the document variables to just 2009 and 2010, and then assign the combined data to our corpus as document variables.

library(lubridate) # year function
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
## 
##     date
# read in speech docvars
speeches_dv <- read.delim('europarl-documents-metadata.tsv', sep = '\t')

# subset metadata to 2009-20012
speeches_dv <- speeches_dv[year(speeches_dv$date) >= 2009 &
                                   year(speeches_dv$date) <= 2010, ]

# read in MEP docvars
MEP_dv <- read.delim('europarl-meps-metadata.tsv', sep = '\t')

# merge MEP docvars onto speech metadata
dv <- merge(speeches_dv, MEP_dv, all.x = T,
            by.x = 'mep_ids', by.y = 'mep_id')

# merge docvars onto corpus
docvars(speeches) <- dv

# inspect first entries
head(docvars(speeches))

The way that quanteda is written, a corpus is intended to be a static object that other transformations of the text are extracted from. In other words, you won’t be altering the speeches corpus object anymore. Instead, you’ll be creating new objects from it. This approach allows us to conduct analyses with different requirements, such as ones that require stemmed input and those that need punctuation retained, without having to recreate the corpus from scratch each time.

We can also subset corpora based on document level variables using the corpus_subset() function. Create a new corpus object that is a subset of the full speeches corpus only containing speeches made by members of the European People’s Party.

EPP_corp <- corpus_subset(speeches, group_shortname == 'EPP')

We can use the texts() function to access the actual text of each observation.

texts(EPP_corp)[5]
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 2009/01/12/TEXT_CRE_20090112_1-019.txt 
## "European Parliament\nHannes Swoboda,\non behalf of the PSE Group.\n(DE)\nMr President, we have, of course, given this matter a great deal of thought. Perhaps Mr Cohn-Bendit overestimates the significance of a resolution, but with the Security Council's resolution we have a basis which we should support and, as the President of Parliament has already said, we should require both sides to seek peace, to lay down their arms and to comply with the Security Council's resolution. I would, however, just like to add that this must be the gist of our resolution. If this is so, we can support it. In this context we would cooperate and in this context we would support Mr Cohn-Bendit's motion.\n"

Data Exploration and Descriptive Statistics

We can use the KeyWord in Context function kwic() to quickly see the words around a given word for whenever it appears in our corpus. Try finding the context for every instance of a specific word in our corpus. You might want to pick a slightly uncommon word to keep the output manageable.

kwic(speeches, 'hockey', window = 7)

Right away, we can see two very different contexts that hockey appears in: actual ice hockey and the IPCC hockey stick graph. Any model of text as data that’s worthwhile needs to be able to distinguish between these two different uses.

We can also save the summary of a corpus as a dataframe and use the resulting object to plot some simple summary statistics. By default, the summary.corpus() function only returns the first 100 rows. Use the ndoc() function (which is analogous to nrow() or ncol(), but for the number of documents in a corpus) to get around this. Plot the density of each country’s tokens on the same plot, to explore whether MEPs from some countries are more verbose than others (you’ll be able to see better if you limit the x axis to 1500; there are handful of speeches with way more words that skew the graph).

# extract summary statistics
speeches_df <- summary(speeches, n = ndoc(speeches))

library(ggplot2) # ggplots

# plot density of tokens by country
ggplot(data = speeches_df, aes(x = Tokens, fill = country)) +
  geom_density(alpha = .25, linetype = 0) +
  theme_bw() +
  coord_cartesian(xlim = c(0, 1500)) +
  theme(legend.position = 'right',
        plot.background = element_blank(),
        panel.grid.minor = element_blank(),
        panel.grid.major = element_blank(),
        panel.border = element_blank())

We can also see if earlier speeches differ substantively from later ones in terms of the number of tokens per speech. Break this plot down by groups because there are so many countries that it would be unreadable.

ggplot(data = speeches_df, aes(x = ymd(date), y = Tokens, color = group_shortname)) +
  geom_smooth(alpha = .6, linetype = 1, se = F, method = 'loess') +
  theme_bw() +
  theme(legend.position = 'right',
        plot.background = element_blank(),
        panel.grid.minor = element_blank(),
        panel.grid.major = element_blank(),
        panel.border = element_blank())

We’ve got all kinds of things going on here. This is definitely a case where a regular LDA model might miss some stuff given all the differences in speech length over time by party. The biggest issue, however, is that NA category. stm doesn’t handle NAs well, so we need to get rid of any texts that have missing data. We can’t just use the na.omit() function on a corpus, so we need to use the corpus_subset() function in quanteda to write a logical test that will drop any texts with missing document variables. We’ll want to fit a model that uses the speaker (MEP ID), country, and party as structural variables, so figure out which of these have missing document variables in our corpus and then drop them with corpus_subset().

# count number of missing observations in each document variable
apply(speeches$documents, 2, function(x) sum(is.na(x)))
##           texts           _type         mep_ids     document_id 
##               0               0               0               0 
##            date           title         surname       firstname 
##               0               0             270             270 
##         country   country_short           group group_shortname 
##             270             270             270            2753
# drop any texts with missing document variables
speeches <- corpus_subset(speeches, !is.na(country_short))

# count number of missing observations in each document variable again
apply(speeches$documents, 2, function(x) sum(is.na(x)))
##           texts           _type         mep_ids     document_id 
##               0               0               0               0 
##            date           title         surname       firstname 
##               0               0               0               0 
##         country   country_short           group group_shortname 
##               0               0               0            2483

Looks like we’ll be using group instead of the more convenient group_shortname variable due to missingness. Before we go about making our document feature matrix and fitting a structural topic model, we need to drop some observations. There are 33340 documents in these two years of European Parliament speeches, and an STM will take several hours to run on them. Take a random sample of 1/20 of speeches using the corpus_sample() function.

speeches_sub <- corpus_sample(speeches, size = floor(ndoc(speeches) / 10))

Now let’s extract some summaries that can be used in text analysis to create other descriptions of the data. Use the dfm() function to conduct a document frequency matrix for the speeches You’ll want to convert the input to lowercase, conduct stemming, and remove punctuation and stop words. One we’ve created a document frequency matrix, we can use the topfeatures() function to see the top \(n\) tokens in our corpus.

# create document feature matrix from corpus
speeches_dfm <- dfm(speeches_sub, tolower = T, stem = T,
                    remove = stopwords('english'), remove_punct = T)

# view top 16 tokens
topfeatures(speeches_dfm, 16)
##   european parliament       will     presid       also       must 
##       8287       4950       3435       2515       2316       2260 
##      state         mr      union     member         eu    countri 
##       2157       2035       1969       1904       1863       1718 
##    commiss       need      right     report 
##       1700       1658       1624       1603

Structural Topic Models

We could use the readCorpus() function in stm to convert our document frequency matrix from quanteda into a format that stm understands. However, this would result in the loss of our document variables, so we want to use covnert() function in quanteda instead and specify that we’re converting to an object for stm. In certain cases, we may want to remove rarely used words from our corpus. The prepDocuments() function does this, and the can use plotRemoved() to view how the threshold we set for removal affects our corpus.

library(stm)
## stm v1.2.2 (2017-03-28) successfully loaded. See ?stm for help.
speeches_stm <- convert(speeches_dfm, to = 'stm')
plotRemoved(speeches_stm$documents, lower.thresh = seq(1, 25, by = 1))

speeches_stm <- prepDocuments(speeches_stm$documents, speeches_stm$vocab,
                              speeches_stm$meta, lower.thresh = 5)
## Removing 8683 of 12753 terms (16167 of 280370 tokens) due to frequency 
## Your corpus now has 3334 documents, 4070 terms and 264203 tokens.

We’re now ready to fit a structural topic model using speaker party and country as our covariates. We can’t just feed the stm() function our corpus object. Instead, we have to specify each argument with the elements from our corpus, and we define the covariate effect with standard R formula notation. Given the size of our corpus, this model takes several minutes to run, so I’ve already run it and saved the resulting STM object in the file fit_stm.RDS. Load the STM object for the following exploration.

fit_stm <- stm(documents = speeches_stm$documents, vocab = speeches_stm$vocab,
               K = 15, prevalence = ~ country + group, seed = 374075,
               data = speeches_stm$meta, sigma.prior = .1)

Once we’ve fit our model, we want to assess the topics that it has identified, and see whether the topic assigned to each document aligns with the decision we would make. We can use the labelTopics() function to view the top \(n\) words for each topic. The first set of words is a list of words with the highest probability of being included in each topic. You’ll notice that there’s a lot of overlap between topics here, so this may not be the most useful. To get at the distinct character of each topic, the FREX words are those that are both frequent and not shared by other topics. Score and Lift are the top words for measures in other text analysis packages.

labelTopics(fit_stm, n = 10)
## Topic 1 Top Words:
##       Highest Prob: must, need, problem, howev, clear, respons, parliament, time, european, de 
##       FREX: problem, clear, respons, ni, solut, de, need, howev, must, find 
##       Lift: franz, obermayr, prudent, andrea, bron, jahr, mölzer, othmar, embrac, kara 
##       Score: problem, need, must, clear, de, ni, respons, mölzer, solut, andrea 
## Topic 2 Top Words:
##       Highest Prob: right, peopl, human, countri, european, parliament, govern, polit, intern, presid 
##       FREX: israel, gaza, iran, kosovo, afghanistan, religi, sentenc, prison, isra, regim 
##       Lift: 20th, albania, arrest, brave, hama, herzegovina, islam, kosovo, lukashenko, lunacek 
##       Score: israel, human, belarus, iran, gaza, violenc, arrest, religi, murder, isra 
## Topic 3 Top Words:
##       Highest Prob: mr, presid, like, us, will, commission, say, now, want, thank 
##       FREX: say, said, let, commission, know, want, thank, us, look, thing 
##       Lift: frank, hann, say, someon, spoke, swoboda, apologis, chris, davi, derk 
##       Score: say, presid, mr, us, commission, thank, want, let, said, know 
## Topic 4 Top Words:
##       Highest Prob: budget, transport, financi, european, agenc, year, manag, budgetari, passeng, sector 
##       FREX: budget, transport, agenc, budgetari, maritim, rail, auditor, passeng, discharg, ship 
##       Lift: auditor, mathieu, waterway, aviat, carrier, coach, concili, mode, multiannu, navig 
##       Score: budget, transport, passeng, maritim, budgetari, rail, auditor, agenc, discharg, sea 
## Topic 5 Top Words:
##       Highest Prob: energi, develop, european, chang, climat, strategi, invest, econom, will, sustain 
##       FREX: climat, gas, energi, emiss, innov, research, sustain, 2020, invest, technolog 
##       Lift: planet, a7-0313, a7-0331, b6-0003, b7-0616, biolog, biomass, cancún, co, dioxid 
##       Score: energi, climat, emiss, gas, 2020, technolog, innov, carbon, research, strategi 
## Topic 6 Top Words:
##       Highest Prob: right, european, protect, state, citizen, human, member, data, eu, union 
##       FREX: convent, immigr, data, asylum, illeg, traffick, sexual, terror, judici, charter 
##       Lift: 1995, a6-0026, a7-0015, acta, angelilli, b7-0228, busuttil, consuegra, convent, díaz 
##       Score: data, human, traffick, right, convent, terror, asylum, court, crime, treati 
## Topic 7 Top Words:
##       Highest Prob: member, state, propos, commiss, regul, direct, legisl, system, market, rapporteur 
##       FREX: regul, legisl, direct, harmonis, substanc, assess, scope, propos, simplifi, system 
##       Lift: dalli, 1986, biocid, hazard, transplant, cosmet, tender, uncertainti, vat, owner 
##       Score: regul, legisl, market, anim, propos, direct, legal, solvit, state, cosmet 
## Topic 8 Top Words:
##       Highest Prob: product, agricultur, food, european, produc, industri, countri, million, aid, eur 
##       FREX: farmer, disast, agricultur, flood, food, produc, farm, product, price, water 
##       Lift: nil, 280, banana, bové, cereal, coastlin, dăncilă, dian, dioxin, disast 
##       Score: product, agricultur, food, farmer, disast, eur, anim, flood, farm, industri 
## Topic 9 Top Words:
##       Highest Prob: parliament, european, vote, presid, group, will, amend, debat, 2010, take 
##       FREX: tabl, amend, statement, paragraph, request, tomorrow, written, ale, vert, socialist 
##       Lift: 149, noon, paragraph, part-sess, thursday, wednesday, 00, 142, adjourn, tuesday 
##       Score: vote, amend, statement, debat, presid, 149, written, tabl, resolut, paragraph 
## Topic 10 Top Words:
##       Highest Prob: will, also, import, issu, first, work, good, can, believ, regard 
##       FREX: good, first, hope, issu, gentlemen, ladi, regard, mention, point, exampl 
##       Lift: b7-0407, färm, göran, marit, oll, segelström, good, agent, iceland, fourth 
##       Score: good, work, first, iceland, issu, import, gentlemen, ladi, second, forward 
## Topic 11 Top Words:
##       Highest Prob: european, write, parliament, report, inform, vote, s, d, favour, consum 
##       FREX: label, pt, patient, write, inform, advertis, consum, internet, s, d 
##       Lift: advertis, auconi, gambl, sophi, artist, bonsignor, diet, mislead, nutrit, olejniczak 
##       Score: consum, write, inform, pt, vote, label, advertis, patient, internet, qualiti 
## Topic 12 Top Words:
##       Highest Prob: european, crisi, state, polici, govern, econom, treati, financi, member, lisbon 
##       FREX: gue, ngl, euro, greec, fish, tax, bank, fisheri, currenc, fiscal 
##       Lift: discard, eurobond, imperialist, pact, privatis, toussa, b7-0213, b7-0214, b7-0219, b7-0301 
##       Score: crisi, euro, tax, treati, lisbon, greec, ngl, gue, bank, fish 
## Topic 13 Top Words:
##       Highest Prob: european, union, region, polici, eu, develop, countri, cooper, parliament, import 
##       FREX: region, cohes, ukrain, russia, cultur, cooper, mediterranean, common, strateg, polici 
##       Lift: constantin, latin, petru, a7-0133, euro-mediterranean, iosif, justa, luhan, macro-region, marek 
##       Score: region, cohes, polici, partnership, ukrain, russia, union, moldova, cooper, eastern 
## Topic 14 Top Words:
##       Highest Prob: social, fund, european, peopl, women, employ, work, worker, job, econom 
##       FREX: worker, enterpris, employ, globalis, women, labour, medium-s, unemploy, young, redund 
##       Lift: dell, egf, graduat, medium-s, nace, nurs, part-tim, patern, reintegr, reloc 
##       Score: women, worker, egf, fund, unemploy, labour, employ, redund, social, men 
## Topic 15 Top Words:
##       Highest Prob: european, council, 2009, commiss, question, mr, agreement, parliament, committe, report 
##       FREX: council, 2009, question, affair, mrs, next, negoti, committe, ombudsman, parliamentari 
##       Lift: acp, berè, joli, pervench, acp-eu, caribbean, epa, interim, ombudsman, council 
##       Score: 2009, council, committe, agreement, question, behalf, ombudsman, item, mrs, mr

We can identify some topics very easily. Topic 2 seems to be about conflict and human rights, with a focus on Israel, the Palestinian Territories, and Afghanistan. Topic 4 seems to be about budgets and transportation. Topic 5 is pretty clearly climate change. Topic 6 is immigration and human trafficking. Topic 8 appears to be about agriculture, farming, and the effect of natural disasters on food production. Topic 11 looks to be about consumer and patient rights and protection. Topic 13 contains a lot of words that make it look like it’s about European relations with Russia. Topic 14 seems to cover social welfare issues.

There are also a handful of topics including 1, 7, 9, 10, and 15 that appear to just be capturing procedural language. These suggest that we cannot rely on the standard stop word dictionary in our preprocessing. Ideally, we would create a corpus-specific stopword dictionary that adds words like ‘discuss’, ‘fellow’, ‘question’, ‘answer’, ‘genleman’, and ‘lady’.

We can produce several different plots to explore the topics our model has identified. A plot with type = 'labels' produces the most common words for a give topic. Make a label plot using FREX labels for topics 2, 4, 5, and 6, and one using score labels.

par(mfrow = c(1,2))
plot(fit_stm, type = 'labels', topics = c(2, 4:6), labeltype = 'frex', main = 'FREX')
plot(fit_stm, type = 'labels', topics = c(2, 4:6), labeltype = 'score', main = 'score')

par(mfrow = c(1,1))

A plot with type = 'perspective' produces a plot that shows how different words are related to two different topics. Since one of our covariates is group, we can leverage this to see how different words are related to the same topics for each party. Produce plots that demonstrate how the same words are related to topics 5 and 6 for each party. These plots tell use how each party talks about the same topic using different words.

for (i in 1:length(unique(dv$group))) plot(fit_stm, type = 'perspectives',
                                    topics = c(5,6), labeltype = 'frex',
                                    covarlevels = unique(dv$group)[i],
                                    text.cex = .75, main = unique(dv$group)[i])

We can use the findThoughts() function to view the actual documents associated with specific topics to try and determine if the topic coding makes sense. We need to supply the original documents as an argument to this function, so access them from our quanteda corpus object using $documents$texts. Do this on topic 14, which might be aboue social welfare. We can either browse through all of the responses, or use the plotQuote() function to plot the top \(n\) documents.

thoughts15 <- findThoughts(fit_stm, texts = speeches_sub$documents$texts,
                           topics = 14, n = 7)$docs[[1]]
plotQuote(thoughts15, width = 95, text.cex = .65, maxwidth = 275)

Looks like we’re off base and topic 14 is actually about a whole bunch of different ideas including education, import taxes, and the legitimacy of the EU project.

Use the cloud() function to plot a wordcloud of topic 8 so we can evaluate its association with farming and agriculture.

cloud(fit_stm, topic = 8)

Looks like our initial intutition about topic 8 was pretty spot-on, but there might also be some discussion of agricultural subsidies in there.

A key advantage of STMs over standard LDA is that we can use the estimateEffect() function to obtain estimates of the effect of changes in covariate values in on the probability of a document being about a specific topic. In this case, we want to see how speakers from different parties or countries are more or less likely to give speeches that are about certain topics. Use estimateEffect() to create an object that contains information on these changes which we can manipulate to produce predicted probabilities. It needs an object where it can find all the covariates for our model, so set the metadata argument to estimateEffect() to the documents elements from our corpus because it contains all of those variables. We can use plot() with method = 'difference' to illustrate the mean difference in topic proportions by group. Let’s look at how socialists and euroskeptics talk about climate change and immigration.

est_stm <- estimateEffect( ~ country + group, fit_stm, metadata = speeches_sub$documents)

plot(est_stm, covariate = 'group', topics = 5:6, model = fit_stm, cov.value1 = as.character(unique(speeches_sub$documents$group)[7]), cov.value2 = as.character(unique(speeches_sub$documents$group)[5]), method = 'difference')

Notice that we’ve got measures of uncertainty around these estimates! This plot tells us that the socialists are less likely to talk about climate change than the euroskeptics, but more likely to talk about immigration and human trafficking. Remember, just becuase euroskeptics are talking about climate change doesn’t necessarily mean that they believe in it or are advocating for addressing it. We’d need to drill down more into our topics to determine if they’re speaking about it positively or negatively. We can plot an estimateEffect with method = 'pointestimate' to find out the mean topic proportions for each group for a given topic. Try this with topic 2, our conflict and human rights topic.

plot(est_stm, covariate = 'group', topics = 2, model = fit_stm, method = 'pointestimate')

Notice that the Christian Democrats have the least uncertainty. This is because they have the most speeches in our subset of the corpus. The Independence/Democracy Group have the fewest speeches and largest uncertainty.ff

data.frame(sort(table(speeches_sub$documents$group), decreasing = T))

Individual Exercise

Read through some European Parliament speeches. Identify at least 10 words that are commonly found in opening remarks that you think are procedural and don’t convey any meaning. Use them as additional stopwords to create a new document feature matrix from the (subset) corpus. You can do this by concatenating your vector of custom stopwords with stopwords('english'). Run the same stm model as in this lab. Does the removal of these corpus-specific stopwords change the results of the model? Present the results of labelTopics() with FREX words to support your conclusion. Hint: just because a word is very frequent e.g. Euro doesn’t mean it isn’t substantive, given the presence of euroskeptic parties and the frequency of debate over the EU project.

Greene, Derek, and James P. Cross. 2017. “Exploring the Political Agenda of the European Parliament Using a Dynamic Topic Modeling Approach.” Political Analysis 25 (1): 77–94.