Chaining if_else() statements using the superheroes dataset with dplyr

Holly Emblem
4 min readAug 13, 2018

--

Photo by TK Hammonds on Unsplash

Often, you’ll want to recode variables within R as part of exploratory data analysis (EDA). Using dplyr, it’s super easy to create new variables or recode existing ones using if_else()

if_else() vs ifelse() — A note on efficiency

R actually comes with a base if else function, called ifelse(). Unlike R’s function, dplyr’s if_else() variation is stricter. If_else checks that the true and false values are of the same type, which dplyr suggests makes the output somewhat faster. This has been validated by Colin Gillespie and Robin Lovelace in Efficient R Programming. Gillespie and Lovelace found that dplyr is faster than the base ifelse() but both are slower than a hard coded alternative:

An additional quirk of ifelse() is that although it is more programmer efficient, as it is more concise and understandable than multi-line alternatives, it is often less computationally efficient than a more verbose alternative…

marks = runif(n = 10e6, min = 30, max = 99)
system.time({
result1 = ifelse(marks >= 40, "pass", "fail")
})
#> user system elapsed
#> 4.012 0.276 4.286
system.time({
result2 = rep("fail", length(marks))
result2[marks >= 40] = "pass"
})
#> user system elapsed
#> 0.192 0.032 0.223
identical(result1, result2)
#> [1] TRUE

…A simple solution is to use the if_else() function from dplyr, although, as discussed in the same thread, it cannot replace ifelse() in all situations. For our exam result test example, if_else() works fine and is much faster than base R’s implementation (although it is still around 3 times slower than the hard-coded solution):

system.time({
result3 = dplyr::if_else(marks >= 40, "pass", "fail")
})
#> user system elapsed
#> 1.032 0.104 1.134
identical(result1, result3)
#> [1] TRUE

Chaining if_else() statements

We can now take a look at how easy it is to chain if_else() statements in R. To start with, we’ll work with the superheroes dataset from Kaggle. First, we’ll import and join the datasets:

library(dplyr)
heroes <- read.csv(“heroes_information.csv”)
superPowers <- read.csv(“super_hero_powers.csv”)
names(superPowers)[1]<-paste(“name”)
superJoin <- inner_join(heroes,superPowers, by=”name”)

Next, we want to examine the different columns in the dataset:

str(superJoin)
‘data.frame’: 660 obs. of 179 variables:
$ X : int 0 1 2 3 4 5 6 7 9 10 …
$ name : chr “A-Bomb” “Abe Sapien” “Abin Sur” “Abomination” …
$ Gender : Factor w/ 3 levels “-”,”Female”,”Male”: 3 3 3 3 3 3 3 3 3 3 …
$ Eye.color : Factor w/ 23 levels “-”,”amber”,”black”,..: 20 4 4 9 4 4 4 4 7 1 …
$ Race : Factor w/ 62 levels “-”,”Alien”,”Alpha”,..: 24 33 56 28 12 24 1 24 24 1 …

We can note that gender has three levels. We might wish to create dummy variables for male/female/’-’, in which case we’ll need an if_else() statement which accommodates two false examples and a true.

If we want to create a dummy variable for female superheroes, our chained if_else statement will look as follows:

superJoin$female <- if_else(superJoin$Gender == ‘Female’, 1, if_else(superJoin$Gender == ‘Male’, 0,0))

In this example, you can see that we are creating a new column which will contain 0s if the character is male/’-‘, or ‘1’ if the character is female. We can easily create a new dummy variable for males, like so:

superJoin$male <- if_else(superJoin$Gender == ‘Male’, 1, if_else(superJoin$Gender == ‘Female’, 0,0))

And finally, a dummy variable where the gender is unknown:

superJoin$unknown <- if_else(superJoin$Gender == ‘-, 1, if_else(superJoin$Gender == ‘Male’, 0,0))

Chaining vs OR/AND — Performance

Of course, this isn’t the only way these dummy variables could be created. We could also look to include an or statement within one if_else(), as opposed to chaining them:

superJoin$unknown <- if_else((superJoin$Gender == “Female” | superJoin$Gender == “Male”),0,1)

We have the same end result in this instance, a dummy variable for Unknown or ‘-‘ gender, but instead of chaining an if_else we have instead created an OR statement.

We can use the microbenchmark package to compare these two alternative solutions and review which is more efficient from a performance point of view:

library(microbenchmark)microbenchmark(
superJoin$unknown <- if_else(superJoin$Gender == ‘-’, 1, if_else(superJoin$Gender == ‘Male’, 0,0)),
superJoin$unknown2 <- if_else((superJoin$Gender == “Female” | superJoin$Gender == “Male”),0,1)
)
Unit: microsecondsexprsuperJoin$unknown <- if_else(superJoin$Gender == “-”, 1, if_else(superJoin$Gender == “Male”, 0, 0))superJoin$unknown2 <- if_else((superJoin$Gender == “Female” | superJoin$Gender == “Male”), 0, 1)min lq mean median uq max neval
168.254 177.3195 219.40 189.788 227.0100 505.231 100
124.968 132.6745 155.09 144.728 165.2365 311.911 100

Microbenchmark runs each expression 100 times (neval) and from there calculates the min, lower quartile, mean, median, upper quartile, and max.

We can see from this output that it is actually more efficient to use an OR statement as opposed to chaining if_else(), which is certainly something to bear in mind if you wish to productionise your code.

Useful Resources

Hadley Wickham, Advanced R: http://adv-r.had.co.nz/Performance.html

Colin Gillespie and Robin Lovelace, Efficient R Programming: https://csgillespie.github.io/efficientR/index.html

If_else() documentation: https://www.rdocumentation.org/packages/dplyr/versions/0.7.6/topics/if_else

--

--

Holly Emblem
Holly Emblem

Written by Holly Emblem

Head of Insights at Rare, a Xbox Game Studio. Previous experience as a data scientist and lead. Interested in deep learning, quantum computing and statistics.

No responses yet