1. R basics - vectors, matrices, lists, logicals

A. Vectors - 1D arrays:

Initialize a vector

Two examples of how to initialize a vector

my_vec = c(1,2,3,4,5,6,7,8,9,10)
my_vec
##  [1]  1  2  3  4  5  6  7  8  9 10
my_vec = 1:10 
my_vec
##  [1]  1  2  3  4  5  6  7  8  9 10
Initialize an empty vector

How to initialize an empty vector, to store things in later

my_vec = vector(mode = 'numeric', length = 10 )
my_vec
##  [1] 0 0 0 0 0 0 0 0 0 0

B. Matrices/data frames - 2D arrays:

Initialize a matrix

Initialize a matrix called my_mat

my_mat = matrix(1:10, ncol = 2, nrow = 5)
my_mat
##      [,1] [,2]
## [1,]    1    6
## [2,]    2    7
## [3,]    3    8
## [4,]    4    9
## [5,]    5   10
class(my_mat)
## [1] "matrix"

Name the rows and columns using colnames() and row.names()

colnames(my_mat) = c('A', 'B')
row.names(my_mat) = c('Sample1', 'Sample2', 'Sample3', 'Sample4', 'Sample5')

my_mat 
##         A  B
## Sample1 1  6
## Sample2 2  7
## Sample3 3  8
## Sample4 4  9
## Sample5 5 10

Get the dimensions of my_mat. Dimensions are reported as c(row, column) where numbers of rows in the first bin, and number of columns in the second bin

dim(my_mat)
## [1] 5 2

Can also ask for the number of rows and number of columns directly with ncol() and nrow()

ncol(my_mat)
## [1] 2
nrow(my_mat)
## [1] 5

Can see the column names and row names of my_mat with rownames() and colnames()

#get the column names of my_mat 
colnames(my_mat)
## [1] "A" "B"
#get the row names of my_mat
rownames(my_mat)
## [1] "Sample1" "Sample2" "Sample3" "Sample4" "Sample5"
Access rows/columns in matrices

Format is: my_mat[row, column]

Can access with numeric indices

# row 2, column 1
my_mat[2,1]
## [1] 2

Can access with row names and column names

# row 2, column 1
my_mat['Sample2','A']
## [1] 2

Get all rows, 2nd column. Don’t specify a row, just use a comma

# all rows, 2nd column
my_mat[,2]
## Sample1 Sample2 Sample3 Sample4 Sample5 
##       6       7       8       9      10

Get all columns, 1st row. Don’t specify a column, just use a comma

# all columns, 1st row
my_mat[1,]
## A B 
## 1 6
C. Data frames

Initialize a data frame

my_df = data.frame(A = c(1,2,3,4,5), B = c(6,7,8,9,10), row.names = c('Sample1', 'Sample2', 'Sample3', 'Sample4', 'Sample5'))

my_df
##         A  B
## Sample1 1  6
## Sample2 2  7
## Sample3 3  8
## Sample4 4  9
## Sample5 5 10
# get the class -- data.frame 
class(my_df)
## [1] "data.frame"

Indexing dataframes is the same as matrices, but you can also index the columns using the $.

3 ways to access column A, the first column

my_df[,'A']
## [1] 1 2 3 4 5
my_df[,1]
## [1] 1 2 3 4 5
my_df$A #can't do this with matrices! 
## [1] 1 2 3 4 5

D. Lists

Each element of a list can contain vectors, matrices, lists, etc. of any shape or size.

Here, we have a list where the first element will be a vector, the second is a matrix, and the third is a data.frame

my_list = list(my_vec, my_mat, my_df)

my_list
## [[1]]
##  [1] 0 0 0 0 0 0 0 0 0 0
## 
## [[2]]
##         A  B
## Sample1 1  6
## Sample2 2  7
## Sample3 3  8
## Sample4 4  9
## Sample5 5 10
## 
## [[3]]
##         A  B
## Sample1 1  6
## Sample2 2  7
## Sample3 3  8
## Sample4 4  9
## Sample5 5 10
Accessing element of a list

Use double square brackets to access a list element. Here, we’re accessing the first element of my_list.

my_list[[1]]
##  [1] 0 0 0 0 0 0 0 0 0 0
Acess content with a list element

Access the second element of the list

my_list[[2]]
##         A  B
## Sample1 1  6
## Sample2 2  7
## Sample3 3  8
## Sample4 4  9
## Sample5 5 10

Then you can subset it with matrix subsetting (since its a matrix)

my_list[[2]][1,1]
## [1] 1
my_list[[2]]['Sample1','A']
## [1] 1
Subsetting a list

Subset a list with single square brackets. This will make your list a length of 2 – containing only the 1st and 3rd elements of my_list()

my_list[c(1,3)]
## [[1]]
##  [1] 0 0 0 0 0 0 0 0 0 0
## 
## [[2]]
##         A  B
## Sample1 1  6
## Sample2 2  7
## Sample3 3  8
## Sample4 4  9
## Sample5 5 10

E. Subsetting a data frame with logicals

Say, your samples in my_df came from the following sources: either dog or cat as such.

sample_source = c('cat', 'dog', 'cat', 'cat', 'dog')

See how the length of sample_source is the same as the number of rows in my_df – as each sample is either from “dog” or “cat”

length(sample_source)
## [1] 5
dim(my_df)
## [1] 5 2

Create a logical vector – ask “where does sample source equal cat”. Will give TRUE when sample_source is cat and FALSE when sample_source is dog.

sample_source == 'cat'
## [1]  TRUE FALSE  TRUE  TRUE FALSE

See that the class is a logical

class(sample_source == 'cat')
## [1] "logical"

We can subset the column ‘A’ in my_df on cat samples using the logical vector.

my_df$A[sample_source == 'cat']
## [1] 1 3 4

Similarily, we can subset my_df on the dog samples using another logical vector – and use comma to indicate we want all the columns.

my_df[sample_source == 'dog',]
##         A  B
## Sample2 2  7
## Sample5 5 10

2.1 Iterating through vectors

2.1.1 Iterating through vectors with for loops

A. Structure of a for loop

for (thing in sequence){

do thing

}

B. Examples

Initialize vector of colors

colors = c('red', 'blue', 'green')

Create function my_fav_col() which takes as input a string and outputs another string “My favorite color is ____”

my_fav_col <- function(color){
  return(paste('My favorite color is', color))
}

Iterating through indexing

for (i in 1:length(colors)){ 

  print( my_fav_col(colors[i]) )
  
}
## [1] "My favorite color is red"
## [1] "My favorite color is blue"
## [1] "My favorite color is green"

Iterating through a vector called colors

for (thing in colors){ 

  print( my_fav_col(thing) )
  
}
## [1] "My favorite color is red"
## [1] "My favorite color is blue"
## [1] "My favorite color is green"

2.1.2 Iterating through vectors with sapply()

Pass each element of colors to our function, my_fav_col

sapply(colors, my_fav_col)
##                          red                         blue 
##   "My favorite color is red"  "My favorite color is blue" 
##                        green 
## "My favorite color is green"

2.2 Iterating through matrices

Recall my_mat

my_mat
##         A  B
## Sample1 1  6
## Sample2 2  7
## Sample3 3  8
## Sample4 4  9
## Sample5 5 10

Our goal is to get the mean of each row. Let’s do so with both a for loop (2.2.1) and an apply statement (2.2.2)

2.2.1. Iterating through MATRICES with for loops

for (i in 1:nrow(my_mat)){
  row = my_mat[i,]
  print(mean(row))
}
## [1] 3.5
## [1] 4.5
## [1] 5.5
## [1] 6.5
## [1] 7.5

2.2.2. Iterating through MATRICES with apply()

apply(my_mat, 1, mean)
## Sample1 Sample2 Sample3 Sample4 Sample5 
##     3.5     4.5     5.5     6.5     7.5

3. Vectorized vs. sapply vs. for loop

Say we have 100,000 circle radii (1-100,000) and we want to calculate the area of each circle.

# Vector of circle radii 
radii = 1:100000

Here’s our function to calculate the area of a circle:

# Function for area of a circle 
area_of_circle <- function(r){
  return(pi * r^2)
}

Here’s an example of running our function on one radius – a radius of 10

# Example of using the function 
area_of_circle(r = 10)
## [1] 314.1593

Let’s see how much time it takes to use 1) vectorized R code 2) sapply() statement versus 3) a for loop. We’ll run it N times to compare the trends.

# initialize vectors to store run time 
repeats = 40
time_vec = vector('numeric', length = repeats)
time_sapply = vector('numeric', length = repeats)
time_for = vector('numeric', length = repeats)

A. Vectorized

for (i in 1:repeats){

  start = Sys.time()
  areas_method1 = area_of_circle(radii)
  end = Sys.time()
  
  time_vec[i] = as.numeric(end - start)
}

B. sapply()

for (i in 1:repeats){
  
  start = Sys.time()
  areas_method2 = sapply(radii, area_of_circle) 
  end = Sys.time()
  
  time_sapply[i] = as.numeric(end - start)
}

C. for loop

for (j in 1:repeats){
  
  start = Sys.time()
  
  areas_method3 = vector(mode = 'numeric', length = length(radii))
  
  for (i in 1:length(radii)){
    areas_method3[i] = area_of_circle(radii[i])
    }
  
  end = Sys.time()

  time_for[j] = as.numeric(end - start)
}

D. Comparing speed for vectorized code, for loop, apply

max_val = max(c(time_vec, time_sapply, time_for)) + 0.01

hist(time_vec, col = rgb(1,0,0,0.5), breaks = seq(0,max_val, 0.01), xlim = c(0,max_val), 
     main = 'Comparing speed for vectorized code, for loop, apply', xlab = 'Run time (seconds)')
hist(time_sapply, col = rgb(0,1,0,0.5), breaks = seq(0,max_val, 0.01),add = TRUE)
hist(time_for, col = rgb(0,0,1,0.5), breaks = seq(0,max_val, 0.01),add = TRUE)
legend('topright', fill = c('red', 'green', 'blue'), legend = c('vectorized', 'sapply', 'for loop'))

To summarize:

  • vectorized code is faster than sapply() and for loops

  • for loops and sapply() are similar in speed, depending on the length of the vector and the task

  • for loops completes same task in multiple lines of code, vectorization & sapply in 1 line of code so you might want to use an apply family function for clarigy

Let’s explore with examples.

4. Looping over a vector or list: sapply() or lapply()

Read in TV ratings data

tv_data = read.csv('data/IMDb_Economist_tv_ratings.csv', stringsAsFactors = FALSE)

Let’s take a look at the TV ratings data

head(tv_data, 15)
##      titleId seasonNumber          title       date av_rating share
## 1  tt2879552            1       11.22.63 2016-03-10    8.4890  0.51
## 2  tt3148266            1     12 Monkeys 2015-02-27    8.3407  0.46
## 3  tt3148266            2     12 Monkeys 2016-05-30    8.8196  0.25
## 4  tt3148266            3     12 Monkeys 2017-05-19    9.0369  0.19
## 5  tt3148266            4     12 Monkeys 2018-06-26    9.1363  0.38
## 6  tt1837492            1 13 Reasons Why 2017-03-31    8.4370  2.38
## 7  tt1837492            2 13 Reasons Why 2018-05-18    7.5089  2.19
## 8  tt0285331            1             24 2002-02-16    8.5641  6.67
## 9  tt0285331            2             24 2003-02-09    8.7028  7.13
## 10 tt0285331            3             24 2004-02-09    8.7173  5.88
## 11 tt0285331            4             24 2005-03-08    8.5610  3.48
## 12 tt0285331            5             24 2006-03-10    8.7828  2.15
## 13 tt0285331            6             24 2007-03-08    8.3522  2.23
## 14 tt0285331            7             24 2009-03-10    8.4922  2.04
## 15 tt0285331            8             24 2010-03-17    8.4744  1.55
##                     genres
## 1     Drama,Mystery,Sci-Fi
## 2  Adventure,Drama,Mystery
## 3  Adventure,Drama,Mystery
## 4  Adventure,Drama,Mystery
## 5  Adventure,Drama,Mystery
## 6            Drama,Mystery
## 7            Drama,Mystery
## 8       Action,Crime,Drama
## 9       Action,Crime,Drama
## 10      Action,Crime,Drama
## 11      Action,Crime,Drama
## 12      Action,Crime,Drama
## 13      Action,Crime,Drama
## 14      Action,Crime,Drama
## 15      Action,Crime,Drama

Say we wanted to figure out what the max number of seasons was for each show. We could approach this with a for loop or an sapply. Let’s do both and compare.

For loop.

#initialize vector to store max season number in
max_season = rep(NA, length(unique(tv_data$title))) 

# loop through indices 1 to length of the unique TV show titles 
  # access the show name (show)
  # get all the seasons for the show (all_season_numbers)
  # get the max season number and store in max_season[index]

#name storage vector to be the show title 
for (i in 1:length(unique(tv_data$title))){
  show = unique(tv_data$title)[i]
  all_season_numbers = tv_data$seasonNumber[tv_data$title == show]
  max_season[i] = max(all_season_numbers)
}
names(max_season) = unique(tv_data$title)
head(max_season)
##             11.22.63           12 Monkeys       13 Reasons Why 
##                    1                    4                    2 
##                   24           24: Legacy 24: Live Another Day 
##                    8                    1                    1

sapply version

max_season = sapply(unique(tv_data$title), function(show){
   all_season_numbers = tv_data$seasonNumber[tv_data$title == show]
   max(all_season_numbers)
})

sapply() will output your data in the simplest form it can. Here, we see if outputted a vector because we were trying to access 1 value (the max season number) from each tv show title

head(max_season)
##             11.22.63           12 Monkeys       13 Reasons Why 
##                    1                    4                    2 
##                   24           24: Legacy 24: Live Another Day 
##                    8                    1                    1
hist(max_season, breaks = seq(0,45,1), main = 'Number of seasons per show', xlab = 'Number of seasons')

Let’s explore some other ways sapply() can output your data. Here, let’s find the average rating of each season of a show

avg_rating_per_show = sapply(unique(tv_data$title), function(show){
  tv_data$av_rating[tv_data$title == show]
})
head(avg_rating_per_show)
## $`11.22.63`
## [1] 8.489
## 
## $`12 Monkeys`
## [1] 8.3407 8.8196 9.0369 9.1363
## 
## $`13 Reasons Why`
## [1] 8.4370 7.5089
## 
## $`24`
## [1] 8.5641 8.7028 8.7173 8.5610 8.7828 8.3522 8.4922 8.4744
## 
## $`24: Legacy`
## [1] 7.2044
## 
## $`24: Live Another Day`
## [1] 8.899

You can see sapply() outputted a list because each tv show had a variable number of seasons and thus a variable number of ratings. Thus, the simplest data structure was a list.

Let’s look at anther example – Here, we will first subset on those with 2 seasons (or rather, 2 entries in the dataset):

Use an sapply() to figure out how many seasons/entries each show has

num_seasons = sapply(unique(tv_data$title), function(show){
  length(tv_data$seasonNumber[tv_data$title == show])
})

Subset on the show titles with 2 seasons/entries

two_seasons = unique(tv_data$title)[num_seasons == 2]

Find the average rating for the shows with two seasons

avg_rating_two_seasons = sapply(two_seasons, function(show){
  (tv_data$av_rating[tv_data$title == show])
})

head(t(avg_rating_two_seasons))
##                                  [,1]   [,2]
## 13 Reasons Why                 8.4370 7.5089
## 7th Heaven                     7.7000 6.3000
## A Series of Unfortunate Events 8.1052 8.2873
## Alphas                         7.6436 7.8699
## American Crime Story           8.7267 8.3940
## American Gothic                7.8000 7.5350

Because every show had 2 data points, a matrix was the simplest output

We can control output with lapply() – lapply() will ALWAYS output a list.

avg_rating_two_seasons = lapply(two_seasons, function(show){
  (tv_data$av_rating[tv_data$title == show])
})

head(avg_rating_two_seasons)
## [[1]]
## [1] 8.4370 7.5089
## 
## [[2]]
## [1] 7.7 6.3
## 
## [[3]]
## [1] 8.1052 8.2873
## 
## [[4]]
## [1] 7.6436 7.8699
## 
## [[5]]
## [1] 8.7267 8.3940
## 
## [[6]]
## [1] 7.800 7.535

We can use tapply() to summarize data

Here, we want to know what is the average rating per show across all seasaons

avg_rating_per_title = tapply(tv_data$av_rating, tv_data$title, mean)
head(avg_rating_per_title)
##             11.22.63           12 Monkeys       13 Reasons Why 
##             8.489000             8.833375             7.972950 
##                   24           24: Legacy 24: Live Another Day 
##             8.580850             7.204400             8.899000

5. Looping over a matrix/data.frame: apply()

Now let’s play with apply statements with another dataset – bacterial growth curve data.

Growth curve example

growth = read.csv('data/mock_growth_curve_data.csv', row.names = 1)

Let’s take a look at how the data is organized. We have 8 wells growing bacteria, where each well is a column. Rows indicate time, and time was taken every 15 minutes.

head(growth)
##    Well1 Well2 Well3 Well4 Well5 Well6 Well7 Well8
## 0  0.122 0.098 0.111 0.116 0.181 0.180 0.187 0.184
## 15 0.090 0.091 0.090 0.090 0.161 0.161 0.159 0.159
## 30 0.092 0.093 0.092 0.092 0.186 0.186 0.185 0.185
## 45 0.097 0.097 0.097 0.097 0.227 0.228 0.227 0.228
## 60 0.105 0.105 0.105 0.104 0.299 0.300 0.300 0.301
## 75 0.118 0.116 0.118 0.117 0.394 0.394 0.395 0.395

Say we want to take the mean at each timepoint. We can use apply, loop over the rows (1), and use the function mean()

mean_growth = apply(growth, 1, mean)
plot(row.names(growth), mean_growth, type = 'l', ylab = 'OD', xlab = 'Time (min)', main = 'Average growth')

We can also plot our curves with apply. Here, we can plot each well on its own plot.

apply(growth, 2, function(col){
  plot(row.names(growth), col, type = 'l', ylab = 'OD', xlab = 'Time (min)')
})

## NULL

We can also plot all curves on the same plot

plot(row.names(growth), growth$Well1, type = 'l', ylab = 'OD', xlab = 'Time (min)')
apply(growth, 2, function(col){
  lines(row.names(growth), col)
})

## NULL

It looks like there are two growth patterns. Let’s explore by plotting the max difference at each time point.

max_difference = apply(growth, 1, function(row){
  max(row) - min(row)
})
plot(row.names(growth), max_difference, type = 'l', ylab = 'max OD difference', xlab = 'Time (min)')

6. Faster looping with future.apply

Here’s a Data Camp tutorial to learn more about paralell programming in R – specifically future and future.apply()