Have you ever copied and pasted code because you want to reuse it but with different data or in a slightly different way? If so, you might want to make that code into a function! Using functions also makes your code much easier to read.
You’ve already used lots of built-in functions in R. Examples are print(), read.csv(), and sum(), to name a few.
Functions have an input and an output. We provide the input, and then the function does things to generate the output. Another way to put this is functions take arguments (i.e. input) and return an output.
# Anatomy of a function
my_function = function(input){
# do things to input
return(output)
}
Let’s take a look at a few different analogies to get a better idea of what functions are.
| “Function” name | Input: what “function” takes | Under the hood: what “function” does | Output: what “function” returns |
|---|---|---|---|
| Vending machine | Money & snack choice | Some computational/mechanical process | Snack |
| Google maps | Start & end location | Finds fastest route | Directions for fastest route |
Let’s start out writing a simple function, just to learn the basics of how to write functions.
Let’s write a function called pow that calculates the power of two numbers (a base and an exponent). It takes two numbers - a base and an exponent - and returns the base raised to the exponent. It’s important to document what your function does so other people can use it.
# write function to find power of two numbers
pow = function(base, exponent){
# find power of base raised to exponent
# example: pow(3,2)
power = base ^ exponent
return(power)
}
Now let’s test our function out! You can use any two numbers as the input to the function.
If you include the argument names, then you can include the numbers in any order you want:
# using numbers as input
# explicitly name arguments (order doesn't matter)
pow(exponent = 2, base = 3)
## [1] 9
pow(base = 3, exponent = 2)
## [1] 9
Here, you should get the same answer for both.
If you decide to just include the numbers and not the names, then you have to make sure the numbers are in the correct order (i.e. the order in which the arguments are defined in the function - base first and exponent second):
# using numbers as input
# using the order of the arguments (order matters)
pow(3,2)
## [1] 9
pow(2,3)
## [1] 8
Here, you should get a different answer for each.
You can also use variables as input:
# using variables as input
b = 3
e = 2
pow(b,e)
## [1] 9
Just like built-in functions, you can also save the output of the function to a variable:
# saving it to a variable
p = pow(b,e)
print(p)
## [1] 9
We can also write this function in a shorter way if you want. In R, the last line of the function is what is returned; you don’t have to specify return() for it to be returned.
# function to find power of two numbers
pow = function(base, exponent){
# find power of base raised to exponent
# example: pow(3,2)
base ^ exponent
}
# test it out
pow(3,2)
## [1] 9
For more complicated functions, we have to balance code length and readability. You don’t want it so short that people aren’t able to understand it!
Let’s try other inputs, because that’s the real power of using functions (no pun intended).
pow(10,3)
## [1] 1000
Feel free to try out other inputs as well!
The input arguments in the power function are base and exponent. These variables are defined only within the context of the function, not in the global environment. So we can print out base and exponent within the function, but if we try to print out either of these variables outside of the function, we will get an error (unless it’s defined in your global environment). Let’s try it out. What do you think happens if we try to print out base outside of the function?
# print base outside of function
print(base)
## Error in print(base): object 'base' not found
It doesn’t exist! This is called the scope of the variables - they can only be seen in the function, but not in the global environment.
Now let’s get some practice with printing variables inside functions, where they are actually defined. Print base inside function:
# function to find power of two numbers
pow = function(base, exponent){
# find power of base raised to exponent
# example: pow(3,2)
print(base)
base ^ exponent
}
# test it out
p = pow(3,2)
## [1] 3
# print output of function
print(p)
## [1] 9
What happens if we write it this way instead? Why?
# function to find power of two numbers
pow = function(base, exponent){
# find power of base raised to exponent
# example: pow(3,2)
base ^ exponent
print(base)
}
# test it out
p = pow(3,2)
## [1] 3
# print output of function
print(p)
## [1] 3
The output of the last line of the function is what is returned, so in this case base is returned. If you don’t use the return statement, make sure the last line is actually what you want to return!
How about this way? Again, why?
# function to find power of two numbers
pow = function(base, exponent){
# find power of base raised to exponent
# example: pow(3,2)
return(base ^ exponent)
print(base)
}
# test it out
p = pow(3,2)
# print output of function
print(p)
## [1] 9
Code after a return statement won’t be executed, so here base is not printed out because the function stops at the return line.
If you want something to normally happen, but have the option for it to not happen, you can use optional arguments. For instance, if you want to have the default be to print out the base variable, but give the user the option to not print it out if they want. You can code this using another argument, say, the print_base argument. In this case, if the user doesn’t specify a value, the function uses the default option, which is defined where you define the argument in the function:
pow = function(base, exponent, print_base=TRUE){
if(print_base){
print(base)
}
base ^ exponent
}
# default
pow(2, 3)
## [1] 2
## [1] 8
# print_base = F
pow(2, 3, print_base = FALSE)
## [1] 8
If you want more practice with default arguments, try adding an optional print_exponent argument with a default value of TRUE.
pow = function(base, exponent, print_base=TRUE, print_exponent=TRUE){
if(print_base){
print(base)
}
if(print_exponent){
print(exponent)
}
base ^ exponent
}
# default
pow(2, 3)
## [1] 2
## [1] 3
## [1] 8
# print_base = F
pow(2, 3, print_base = FALSE)
## [1] 3
## [1] 8
Note: You have to include arguments that don’t have default values. If not, then you get an error because there is nothing stored in that variable in the function, so the code inside can’t be executed:
# what happens if you run this line?
pow(2)
## [1] 2
## Error in print(exponent): argument "exponent" is missing, with no default
What argument are we missing here?
Another important note is that it doesn’t matter what we call the input arguments. Right now, the input arguments are base and exponent. Let’s try changing them to something totally random, maybe pizza and pie. Pizza and pie probably doesn’t have anything to do with the input (two numbers), but the computer doesn’t know that!
# use pizza as variable name
# function to find power of two numbers
pow = function(pizza, pie){
# find power of base raised to exponent
# example: pow(3,2)
pizza ^ pie
}
# test it out
pow(3,2)
## [1] 9
Although you can name your input arguments anything since the computer doesn’t care, you actually want to name them something useful so that people reading the code (including your future self!) can more easily understand what’s going on. Thinking of good variable names can be hard, but it’s important!
If you want to return multiple variables from your function, such as the base, the exponent, and the result, you can return them as a list.
pow = function(base, exponent, print_base=TRUE){
if (print_base){
print(base)
}
return(list(base=base,
exp=exponent,
p=base ^ exponent))
}
# test it out
result = pow(3,2)
## [1] 3
print(result)
## $base
## [1] 3
##
## $exp
## [1] 2
##
## $p
## [1] 9
Say you make a function that you want to be able to reuse for different analyses in different scripts. You can save it in its own R script and then load it into other scripts using the source function. Here’s an example of how you would use it if the name of the script containing your function is power.R:
# load power function
source('/path/to/power.R')
# use power function
pow(3,2)
Now that we’ve gone over how to write a function, it’s time to try it out yourself! Write a function called average that takes a vector of numbers and returns the mean of those numbers. If you want to get fancy, try writing it in a different script and then using the source function to load it in and use it. You can test your function using an input you know the answer to as well as the built-in mean function in R.
# function to calculate mean
average = function(numbers){
# find mean of vector of numbers
sum(numbers)/length(numbers)
}
# test it out
average(0:100)
## [1] 50
mean(0:100)
## [1] 50
Okay, now we’re ready to move on to a more interesting example. One thing I often find myself wanting to do is plot multiple histograms. For instance, the ages of people in two different groups. Let’s write a function to do this!
First, let’s download some data we’ll use.
# create data directory if there isn't one
dir.create('data',showWarnings = F)
# download data if you don't already have it
if(!file.exists("data/gapminder_data.csv")){
download.file("https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/gh-pages/_episodes_rmd/data/gapminder_data.csv", destfile = 'data/gapminder_data.csv')
}
# read in gapminder data
gapminder = read.csv('data/gapminder_data.csv')
# look at gapminder data
head(gapminder)
## country year pop continent lifeExp gdpPercap
## 1 Afghanistan 1952 8425333 Asia 28.801 779.4453
## 2 Afghanistan 1957 9240934 Asia 30.332 820.8530
## 3 Afghanistan 1962 10267083 Asia 31.997 853.1007
## 4 Afghanistan 1967 11537966 Asia 34.020 836.1971
## 5 Afghanistan 1972 13079460 Asia 36.088 739.9811
## 6 Afghanistan 1977 14880372 Asia 38.438 786.1134
Say we want to look at the distribution of life expectancy for countries grouped by continent. We could just write some code to do this, but what if we also want to look at the distribution of gdp per capita grouped by continent, or what if we want to look at the distribution of values grouped by something else for a totally different data set? This is why a function could be useful here. Let’s call our function multihist, meaning we want to plot multiple histograms on one plot.
Side note: When you want to use variables instead of column names for ggplot, like we’ll want to do for the function, you have to use aes_string instead of aes.
# load library
library(ggplot2)
# function to plot multiple histograms from list of vectors
# input:
# df: dataframe with information to plot in the columns
# x: column name with x values
# y: column name with values to separate by
# output: histogram
multihist = function(df,x,y){
ggplot(df, aes_string(x, fill = y)) +
geom_histogram(alpha = 0.5, position = 'identity') + theme_classic() # counts
}
Now let’s test out our function:
# test out function
multihist(gapminder,'pop','continent')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We can also loop over many variables and use our function to make a plot for each:
# loop over multiple variables
for(var in c('pop','lifeExp','gdpPercap')){
# in a for loop, you have to print the plot out to see it
print(multihist(gapminder,var,'continent'))
}
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Nice job! If you want more practice, try making a histogram of life expectancy stratified by year. You can also use this same function on another dataframe. Feel free to test it out on one of your own!
gapminder$year = as.character(gapminder$year)
multihist(gapminder,'lifeExp','year')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
##
##
## processing file: writing-functions-r.Rmd
## Error in parse_block(g[-1], g[1], params.src): duplicate label 'setup'