Functional programming covers everything that you can do with functions that is more advanced than making a function which accepts a vector and returns a vector as output. You can create functions inside of other functions, you can make functions which create functions, and even use functions as arguments to other functions. You’ve already done some of these things without even realizing it.
Before we dive into functions and functional programming, it’s helpful to understand how functions actually work in R. The browser()
function lets us get inside a function and see what happens line by line. Try writing a weighted mean function that accepts a vector x
and a scalar weights
, checks to see that x
is numeric or integer, multiplies x
by weights
, sums x
, and then divides by the length of x
, before returning the result. However, include the browser()
function on the first line of the function.
wgt.mean <- function(x, weights = 1) {
browser()
if (is.numeric(x) | is.integer(x)) {
x <- x * weights
sum(x) / length(x)
}
}
wgt.mean(1:15)
Run the function with a random vector of your choice. Notice that the console now has a bunch of buttons along the top, and the environment pane has changed. At the very top left, click the blue icon and select ‘global environment.’ The environment pane should now look just like it did before; that’s because the environment pane for your function only lists objects that are visible to your function! However, the values for these objects are greyed out. These values are known by your function because they’re arguments to it, but they haven’t actually been called yet. Once they do, they will switch from grey to black. If your function creates new objects from scratch, they won’t appear in the list at all until they’re created.
So how do we move things alone and turn those grey values black? Lets turn to those new buttons along the top of the console. Press the ‘Next’ button and see what happens. The highlight moves to the next line the function. Keep pressing ‘Next’ until you are through the entire function. Notice that x
and weights
shift from grey to black as they are called, and x
changes value after being multiplied by weights
. Within functions, R first looks for objects by name within the function environment, and then up the next level. This process will continue all the way up to the global environment, so you can remove weights
as a function argument, define it as an object in the global environment, and the function will still run. You generally want to avoid this because it means the function is no longer self-contained and portable, but this is good to know because it can be a common cause of errors when writing your own function – yet another reason not to use generic object names like x
!.
browser()
can be very useful for debugging if you’re running into errors when writing your own functions. Add it to the first line of the function, and then step through each line, paying close attention to the function’s environment. For especially complicated functions with several loops or conditions, this can be much faster than old fasioned debug-by-print, which would require you to print statements at many different points.
Anonymous functions are functions you don’t have to explicitly define as a separate object; you can just use them inside another line of code. Functions like apply()
have an argument that takes a function. Normally we use a base R function, a function from a package, or a function we’ve written ourselves. This last case gives us a chance to reduce the amount of code it takes us to accomplish a task. If we want to know the range that all the variables in a dataset cover, we can use an anonymous function. Try typing the code below (click the ‘code’ button to see it) into the console and running it. We can also use anonymous functions to de-mean the data.
apply(mtcars, 2, function(x) max(x) - min(x))
apply(mtcars, 2, function(x) x - mean(x))
Use the code below to load the COW non-state wars data. Note that I’m using the data to create the wars_stable
dataframe since we’ll be performing many different operations on wars
, so we can just conveniently re-assign wars_stable
to wars
after each change.
wars_stable <- read.csv('https://www.correlatesofwar.org/data-sets/COW-war/non-state-war-data-1/at_download/file')[, c(1 , 2, 12, 22:24)]
wars_stable$TotalCombatDeaths <- as.integer(as.character(gsub(',', '', wars_stable$TotalCombatDeaths))) # drop a thousands comma
We need to replace every single -9
with NA
so that R knows how to handle these missing values, and doesn’t think that there were lots of wars responsible for 9 births. The simplest way to do this is to replace them column by column.
wars <- wars_stable
wars$SideADeaths[wars$SideADeaths == -9] <- NA
wars$SideBDeaths[wars$SideBDeaths == -9] <- NA
wars$TotalCombatDeaths[wars$TotalCombatDeaths == -9] <- NA
This is less than ideal for many reasons. If you’re working with a real dataset, there are probably more than three variables with missingness (seriously, download the whole COW non-state wars data and see how many variables have missing observations). There’s also the risk of making a mistake in copy-pasting them and accidentically overwriting side A’s combat deaths with NA
s based on the position of -9
s in side B’s combat deaths. A better approach is to write a function to handle this replacement.
wars <- wars_stable
replace_missing <- function(x) {
x[x == -9] <- NA
x
}
wars$SideADeaths <- replace_missing(wars$SideADeaths)
wars$SideBDeaths <- replace_missing(wars$SideBDeaths)
wars$TotalCombatDeaths <- replace_missing(wars$TotalCombatDeaths)
This is an improvement, but we can still do this even faster with lapply()
.
wars <- wars_stable
replace_missing <- function(x) {
x[x == -9] <- NA
x
}
wars[] <- lapply(wars, replace_missing) # [] returns a dataframe instead of a list
We can condense this even further by using an anonymous function as an argument to lapply()
instead of explicitly declaring a function. Try accomplishing this missing data correction with just one line of code (hint: you’ll need to use ifelse()
instead of the <-
assignment operator in an anonymous function).
wars <- wars_stable
wars[] <- lapply(wars, function(x) ifelse(x == -9, NA, x))
This is much faster, but the function is tailored to one specific missing data code. What happens when we have data with more than one missing data code? We could just write a function for each one, but there is a faster way using closures.
Closures are functions that return functions. You can think of them as a function that makes functions. We can harness this behavior to write a function which we can feed a specific missing data code, and it will return a function tailored to replacing that specific code with NA
s.
Re-download the COW non-state wars data, but don’t drop any columns this time. Write a closure that can create functions to replace any missing data code with NA
, then use it to create functions and replace all missing data codes (-8,-9
) with NA
. (Why should you be using lapply()
instead of apply()
in this code?)
wars <- wars_stable
replace_missing <- function(code) {
function(x) {
x[x == code] <- NA
x
}
}
replace_8 <- replace_missing(-8)
replace_9 <- replace_missing(-9)
wars[] <- lapply(wars, replace_8)
wars[] <- lapply(wars, replace_9)
A functional is a function that takes a function as an argument, and returns a vector as an output. The apply()
family are all functionals because the final argument FUN
asks for a function. Try writing a functional that accepts different summary functions as an argument, and then applies them to a 1000 random draws from a \(\mathcal{U}(-5,5)\) distribution.
mc.summary <- function(func) func(runif(1000, -5, 5))
mc.summary(mean)
mc.summary(min)
mc.summary(max)
mc.summary(sd)
mc.summary(range)
## [1] -0.089
## [1] -5
## [1] 5
## [1] 2.9
## [1] -5 5
We can combine functionals with anonymous functions to greatly reduce the amount of code needed to accomplish various tasks. First, use a functional to drop all non-integer columns from wars
.
wars <- wars_stable
wars <- wars[, which(apply(wars, 2, class) == 'integer')] # this doesn't work; why?
wars <- wars[, which(sapply(wars, class) == 'integer')] # this does
Next, write a loop that creates a mean-centered version of the wars data, and a normalized version. Then use functionals and anonymous functions to accomplish the same thing in two lines of code.
wars <- wars_stable; wars$WarName <- NULL
wars_cent <- matrix(0, nrow(wars), ncol(wars))
wars_norm <- matrix(0, nrow(wars), ncol(wars))
for (i in 1:ncol(wars)) {
wars_cent[, i] <- wars[, i] - mean(wars[, i], na.rm = T)
wars_norm[, i] <- wars[, i] / max(wars[, i], na.rm = T)
}
wars_cent <- data.frame(lapply(wars, function(x) x - mean(x, na.rm = T)))
wars_norm <- data.frame(lapply(wars, function(x) x / max(x, na.rm = T)))
The reshape2
package provides functions that let us easily aggregate and summarize data by our chosen variable(s). This process involves the melt()
function which converts from wide to long, and a series of *cast()
functions that use an aggregation function to convert back to a reduced wide form. For example, if we want to obtain the average fuel economy of cars in the mtcars
dataset by number of cylinders, we would use dcast(melt(mtcars, id.vars = 'cyl'), cyl ~ variable, mean)
(since we perform this operation on the entire dataframe, it will also give us the mean of all other variables, by number of cylinders). Notice that the same ID variable is used in both function calls; this is how we tell the function which variable we want to aggregate by. We can combine the reshape2
package with anonymous functions to obtain by-group summary statistics beyond the standard built in ones. Use an anonymous function to obtain the difference between the maximum and minimum of all other variables for the mtcars
data, divided by transmission type.
library(reshape2)
dcast(melt(mtcars, id.vars = 'am'), am ~ variable, function(x) max(x) - min(x))
## am mpg cyl disp hp drat wt qsec vs gear carb
## 1 0 14 4 352 183 1.2 3.0 7.5 1 1 3
## 2 1 19 4 280 283 1.4 2.1 5.4 1 1 7
We can obtain summary statistics by multiple ID variables as well, but this requires us to return our results in an array instead. Try getting the mean of all variables in the mtcars
data by transmission type and number of cylinders.
acast(melt(mtcars, id.vars = c('am', 'cyl')), am ~ cyl ~ variable, mean)[1:2, 1:3, c(1, 3)]
## , , mpg
##
## 4 6 8
## 0 23 19 15
## 1 28 21 15
##
## , , hp
##
## 4 6 8
## 0 85 115 194
## 1 82 132 300
The plyr
package allows us to simplify many operations on data that we could perform using base R functions. It consists of a series of **ply()
functions, where the first letter denotes the input type, and the second letter denotes the output. The most common are d
atafame, l
ist, and a
rray. We can use the transform()
function to append a new column to a dataframe as a function of its existing columns. This syntax makes it easy to combine with ddply()
to add a column to an existing dataframe that is the result of by-group operations on other columns. Use ddply()
and transform()
to add a new column to the baseball
dataset that is the player’s career year for each season (current year - starting year).
library(plyr)
baseball <- ddply(.data = baseball, .variables = 'id', .fun = transform,
cyear = year - min(year) + 1, .progress = progress_text(3))
The plyr
package can also be used to easily carry out the ‘no pooling’ approach to analyzing hierarchical data. We want to carry out \(m\) different regressions where \(m\) is the number of groups in the data, estimating individual slopes and intercepts for each group. We could do this with a for loop, or potentially using lapply. However, plyr
is more convenient because we can control which form the results will be returned in. The dlply()
function accepts dataframes and returns a list, which is ideal if we want to input a dataframe and receive a list of regressions. Similarly, we can use ldply()
to get a dataframe of coefficients from this list of regressions. Use the functions in the plyr
package to perform a ‘no pooling’ analysis of the Orange
dataset, regressing age on circumference. Create dataframes for the estimated coefficients and standard errors for each tree. Try to see if you can do this in just three lines of code.
orange_models <- dlply(Orange, 'Tree', function(x) lm(circumference ~ age, data = x))
orange_coefs <- ldply(orange_models, coef)
orange_ses <- ldply(orange_models, function(x) sqrt(diag(vcov(x))))
Take a grouped dataset (observations within states, individuals, regions, years, etc.) that is of interest to you and use reshape2
to obtain group means and standard deviations, and plyr
to conduct a no pooling analysis of a response variable. Report the estimated coefficients and standard errors in dataframes.
Using the Broken Function.R
script on Sakai, work with browser()
and the debugger to find all the mistakes in the function. Fix them so that the last two lines of the script return a vector of zeroes. Include the corrected function in your notebook and demonstrate that it works.