Chaining if_else() statements using the superheroes dataset with dplyr
Often, you’ll want to recode variables within R as part of exploratory data analysis (EDA). Using dplyr, it’s super easy to create new variables or recode existing ones using if_else()
if_else() vs ifelse() — A note on efficiency
R actually comes with a base if else function, called ifelse(). Unlike R’s function, dplyr’s if_else() variation is stricter. If_else checks that the true and false values are of the same type, which dplyr suggests makes the output somewhat faster. This has been validated by Colin Gillespie and Robin Lovelace in Efficient R Programming. Gillespie and Lovelace found that dplyr is faster than the base ifelse() but both are slower than a hard coded alternative:
An additional quirk of
ifelse()
is that although it is more programmer efficient, as it is more concise and understandable than multi-line alternatives, it is often less computationally efficient than a more verbose alternative…
marks = runif(n = 10e6, min = 30, max = 99)
system.time({
result1 = ifelse(marks >= 40, "pass", "fail")
})#> user system elapsed
#> 4.012 0.276 4.286system.time({
result2 = rep("fail", length(marks))
result2[marks >= 40] = "pass"
})#> user system elapsed
#> 0.192 0.032 0.223identical(result1, result2)
#> [1] TRUE
…A simple solution is to use the
if_else()
function from dplyr, although, as discussed in the same thread, it cannot replaceifelse()
in all situations. For our exam result test example,if_else()
works fine and is much faster than base R’s implementation (although it is still around 3 times slower than the hard-coded solution):
system.time({
result3 = dplyr::if_else(marks >= 40, "pass", "fail")
})#> user system elapsed
#> 1.032 0.104 1.134
identical(result1, result3)
#> [1] TRUE
Chaining if_else() statements
We can now take a look at how easy it is to chain if_else() statements in R. To start with, we’ll work with the superheroes dataset from Kaggle. First, we’ll import and join the datasets:
library(dplyr)
heroes <- read.csv(“heroes_information.csv”)
superPowers <- read.csv(“super_hero_powers.csv”)
names(superPowers)[1]<-paste(“name”)
superJoin <- inner_join(heroes,superPowers, by=”name”)
Next, we want to examine the different columns in the dataset:
str(superJoin)
‘data.frame’: 660 obs. of 179 variables:
$ X : int 0 1 2 3 4 5 6 7 9 10 …
$ name : chr “A-Bomb” “Abe Sapien” “Abin Sur” “Abomination” …
$ Gender : Factor w/ 3 levels “-”,”Female”,”Male”: 3 3 3 3 3 3 3 3 3 3 …
$ Eye.color : Factor w/ 23 levels “-”,”amber”,”black”,..: 20 4 4 9 4 4 4 4 7 1 …
$ Race : Factor w/ 62 levels “-”,”Alien”,”Alpha”,..: 24 33 56 28 12 24 1 24 24 1 …
We can note that gender has three levels. We might wish to create dummy variables for male/female/’-’, in which case we’ll need an if_else() statement which accommodates two false examples and a true.
If we want to create a dummy variable for female superheroes, our chained if_else statement will look as follows:
superJoin$female <- if_else(superJoin$Gender == ‘Female’, 1, if_else(superJoin$Gender == ‘Male’, 0,0))
In this example, you can see that we are creating a new column which will contain 0s if the character is male/’-‘, or ‘1’ if the character is female. We can easily create a new dummy variable for males, like so:
superJoin$male <- if_else(superJoin$Gender == ‘Male’, 1, if_else(superJoin$Gender == ‘Female’, 0,0))
And finally, a dummy variable where the gender is unknown:
superJoin$unknown <- if_else(superJoin$Gender == ‘-, 1, if_else(superJoin$Gender == ‘Male’, 0,0))
Chaining vs OR/AND — Performance
Of course, this isn’t the only way these dummy variables could be created. We could also look to include an or statement within one if_else(), as opposed to chaining them:
superJoin$unknown <- if_else((superJoin$Gender == “Female” | superJoin$Gender == “Male”),0,1)
We have the same end result in this instance, a dummy variable for Unknown or ‘-‘ gender, but instead of chaining an if_else we have instead created an OR statement.
We can use the microbenchmark package to compare these two alternative solutions and review which is more efficient from a performance point of view:
library(microbenchmark)microbenchmark(
superJoin$unknown <- if_else(superJoin$Gender == ‘-’, 1, if_else(superJoin$Gender == ‘Male’, 0,0)),
superJoin$unknown2 <- if_else((superJoin$Gender == “Female” | superJoin$Gender == “Male”),0,1)
)Unit: microsecondsexprsuperJoin$unknown <- if_else(superJoin$Gender == “-”, 1, if_else(superJoin$Gender == “Male”, 0, 0))superJoin$unknown2 <- if_else((superJoin$Gender == “Female” | superJoin$Gender == “Male”), 0, 1)min lq mean median uq max neval
168.254 177.3195 219.40 189.788 227.0100 505.231 100
124.968 132.6745 155.09 144.728 165.2365 311.911 100
Microbenchmark runs each expression 100 times (neval) and from there calculates the min, lower quartile, mean, median, upper quartile, and max.
We can see from this output that it is actually more efficient to use an OR statement as opposed to chaining if_else(), which is certainly something to bear in mind if you wish to productionise your code.
Useful Resources
Hadley Wickham, Advanced R: http://adv-r.had.co.nz/Performance.html
Colin Gillespie and Robin Lovelace, Efficient R Programming: https://csgillespie.github.io/efficientR/index.html
If_else() documentation: https://www.rdocumentation.org/packages/dplyr/versions/0.7.6/topics/if_else