Writing FUNctions in R

“Function” name	Input: what “function” takes	Under the hood: what “function” does	Output: what “function” returns
Vending machine	Money & snack choice	Some computational/mechanical process	Snack
Google maps	Start & end location	Finds fastest route	Directions for fastest route

Real Examples

Boring example so we can get our feet wet

Writing and using a simple function

Let’s start out writing a simple function, just to learn the basics of how to write functions.

Let’s write a function called pow that calculates the power of two numbers (a base and an exponent). It takes two numbers - a base and an exponent - and returns the base raised to the exponent. It’s important to document what your function does so other people can use it.

# write function to find power of two numbers
pow = function(base, exponent){
  # find power of base raised to exponent
  # example: pow(3,2)
  power = base ^ exponent
  return(power)
}

Now let’s test our function out! You can use any two numbers as the input to the function.

If you include the argument names, then you can include the numbers in any order you want:

# using numbers as input
# explicitly name arguments (order doesn't matter)
pow(exponent = 2, base = 3)

## [1] 9

pow(base = 3, exponent = 2)

## [1] 9

Here, you should get the same answer for both.

If you decide to just include the numbers and not the names, then you have to make sure the numbers are in the correct order (i.e. the order in which the arguments are defined in the function - base first and exponent second):

# using numbers as input
# using the order of the arguments (order matters)
pow(3,2)

## [1] 9

pow(2,3)

## [1] 8

Here, you should get a different answer for each.

You can also use variables as input:

# using variables as input
b = 3
e = 2
pow(b,e)

## [1] 9

Just like built-in functions, you can also save the output of the function to a variable:

# saving it to a variable
p = pow(b,e)
print(p)

## [1] 9

We can also write this function in a shorter way if you want. In R, the last line of the function is what is returned; you don’t have to specify return() for it to be returned.

#  function to find power of two numbers
pow = function(base, exponent){
  # find power of base raised to exponent
  # example: pow(3,2)
  base ^ exponent
}

# test it out
pow(3,2)

## [1] 9

For more complicated functions, we have to balance code length and readability. You don’t want it so short that people aren’t able to understand it!

Let’s try other inputs, because that’s the real power of using functions (no pun intended).

pow(10,3)

## [1] 1000

Feel free to try out other inputs as well!

Scope of argument variables

The input arguments in the power function are base and exponent. These variables are defined only within the context of the function, not in the global environment. So we can print out base and exponent within the function, but if we try to print out either of these variables outside of the function, we will get an error (unless it’s defined in your global environment). Let’s try it out. What do you think happens if we try to print out base outside of the function?

# print base outside of function
print(base)

## Error in print(base): object 'base' not found

It doesn’t exist! This is called the scope of the variables - they can only be seen in the function, but not in the global environment.

Printing and returning variables in functions

Now let’s get some practice with printing variables inside functions, where they are actually defined. Print base inside function:

#  function to find power of two numbers
pow = function(base, exponent){
  # find power of base raised to exponent
  # example: pow(3,2)
  print(base)
  base ^ exponent
}

# test it out
p = pow(3,2)

## [1] 3

# print output of function
print(p)

## [1] 9

What happens if we write it this way instead? Why?

#  function to find power of two numbers
pow = function(base, exponent){
  # find power of base raised to exponent
  # example: pow(3,2)
  base ^ exponent
  print(base)
}

# test it out
p = pow(3,2)

## [1] 3

# print output of function
print(p)

## [1] 3

The output of the last line of the function is what is returned, so in this case base is returned. If you don’t use the return statement, make sure the last line is actually what you want to return!

How about this way? Again, why?

#  function to find power of two numbers
pow = function(base, exponent){
  # find power of base raised to exponent
  # example: pow(3,2)
  return(base ^ exponent)
  print(base)
}

# test it out
p = pow(3,2)
# print output of function
print(p)

## [1] 9

Code after a return statement won’t be executed, so here base is not printed out because the function stops at the return line.

Optional arguments with default values

If you want something to normally happen, but have the option for it to not happen, you can use optional arguments. For instance, if you want to have the default be to print out the base variable, but give the user the option to not print it out if they want. You can code this using another argument, say, the print_base argument. In this case, if the user doesn’t specify a value, the function uses the default option, which is defined where you define the argument in the function:

pow = function(base, exponent, print_base=TRUE){
  if(print_base){
    print(base)
  }
base ^ exponent
}
# default
pow(2, 3)

## [1] 2

## [1] 8

# print_base = F
pow(2, 3, print_base = FALSE)

## [1] 8

If you want more practice with default arguments, try adding an optional print_exponent argument with a default value of TRUE.

pow = function(base, exponent, print_base=TRUE, print_exponent=TRUE){
  if(print_base){
    print(base)
  }
  if(print_exponent){
    print(exponent)
  }
base ^ exponent
}
# default
pow(2, 3)

## [1] 2
## [1] 3

## [1] 8

# print_base = F
pow(2, 3, print_base = FALSE)

## [1] 3

## [1] 8

Note: You have to include arguments that don’t have default values. If not, then you get an error because there is nothing stored in that variable in the function, so the code inside can’t be executed:

# what happens if you run this line?
pow(2)

## [1] 2

## Error in print(exponent): argument "exponent" is missing, with no default

What argument are we missing here?

Argument variable names

Another important note is that it doesn’t matter what we call the input arguments. Right now, the input arguments are base and exponent. Let’s try changing them to something totally random, maybe pizza and pie. Pizza and pie probably doesn’t have anything to do with the input (two numbers), but the computer doesn’t know that!

# use pizza as variable name
#  function to find power of two numbers
pow = function(pizza, pie){
  # find power of base raised to exponent
  # example: pow(3,2)
  pizza ^ pie
}

# test it out
pow(3,2)

## [1] 9

Although you can name your input arguments anything since the computer doesn’t care, you actually want to name them something useful so that people reading the code (including your future self!) can more easily understand what’s going on. Thinking of good variable names can be hard, but it’s important!

Returning multiple variables from a function

If you want to return multiple variables from your function, such as the base, the exponent, and the result, you can return them as a list.

pow = function(base, exponent, print_base=TRUE){
  if (print_base){
    print(base)
  }
  return(list(base=base,
              exp=exponent,
              p=base ^ exponent))
}

# test it out
result = pow(3,2)

## [1] 3

print(result)

## $base
## [1] 3
## 
## $exp
## [1] 2
## 
## $p
## [1] 9

Loading in functions from another file

Say you make a function that you want to be able to reuse for different analyses in different scripts. You can save it in its own R script and then load it into other scripts using the source function. Here’s an example of how you would use it if the name of the script containing your function is power.R:

# load power function
source('/path/to/power.R')
# use power function
pow(3,2)

Try writing your own function before we move on

Now that we’ve gone over how to write a function, it’s time to try it out yourself! Write a function called average that takes a vector of numbers and returns the mean of those numbers. If you want to get fancy, try writing it in a different script and then using the source function to load it in and use it. You can test your function using an input you know the answer to as well as the built-in mean function in R.

# function to calculate mean
average = function(numbers){
  # find mean of vector of numbers
  sum(numbers)/length(numbers)
}

# test it out
average(0:100)

## [1] 50

mean(0:100)

## [1] 50

More interesting example

Okay, now we’re ready to move on to a more interesting example. One thing I often find myself wanting to do is plot multiple histograms. For instance, the ages of people in two different groups. Let’s write a function to do this!

First, let’s download some data we’ll use.

# create data directory if there isn't one
dir.create('data',showWarnings = F)
# download data if you don't already have it
if(!file.exists("data/gapminder_data.csv")){
  download.file("https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/gh-pages/_episodes_rmd/data/gapminder_data.csv", destfile = 'data/gapminder_data.csv')
}
# read in gapminder data
gapminder = read.csv('data/gapminder_data.csv')
# look at gapminder data
head(gapminder)

##       country year      pop continent lifeExp gdpPercap
## 1 Afghanistan 1952  8425333      Asia  28.801  779.4453
## 2 Afghanistan 1957  9240934      Asia  30.332  820.8530
## 3 Afghanistan 1962 10267083      Asia  31.997  853.1007
## 4 Afghanistan 1967 11537966      Asia  34.020  836.1971
## 5 Afghanistan 1972 13079460      Asia  36.088  739.9811
## 6 Afghanistan 1977 14880372      Asia  38.438  786.1134

Say we want to look at the distribution of life expectancy for countries grouped by continent. We could just write some code to do this, but what if we also want to look at the distribution of gdp per capita grouped by continent, or what if we want to look at the distribution of values grouped by something else for a totally different data set? This is why a function could be useful here. Let’s call our function multihist, meaning we want to plot multiple histograms on one plot.

Side note: When you want to use variables instead of column names for ggplot, like we’ll want to do for the function, you have to use aes_string instead of aes.

# load library
library(ggplot2)

# function to plot multiple histograms from list of vectors
# input:
#   df: dataframe with information to plot in the columns
#   x: column name with x values
#   y: column name with values to separate by
# output: histogram
multihist = function(df,x,y){
  ggplot(df, aes_string(x, fill = y)) +
    geom_histogram(alpha = 0.5, position = 'identity') + theme_classic() # counts
}

Now let’s test out our function:

# test out function
multihist(gapminder,'pop','continent')

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We can also loop over many variables and use our function to make a plot for each:

# loop over multiple variables
for(var in c('pop','lifeExp','gdpPercap')){
  # in a for loop, you have to print the plot out to see it
  print(multihist(gapminder,var,'continent'))
}

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Nice job! If you want more practice, try making a histogram of life expectancy stratified by year. You can also use this same function on another dataframe. Feel free to test it out on one of your own!

gapminder$year = as.character(gapminder$year)
multihist(gapminder,'lifeExp','year')

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## 
## 
## processing file: writing-functions-r.Rmd

## Error in parse_block(g[-1], g[1], params.src): duplicate label 'setup'

Writing FUNctions in R

Zena Lapp

August 26, 2019

Summary

Motivation

Anatomy of a Function

Function Analogies