Two examples of how to initialize a vector
my_vec = c(1,2,3,4,5,6,7,8,9,10)
my_vec
## [1] 1 2 3 4 5 6 7 8 9 10
my_vec = 1:10
my_vec
## [1] 1 2 3 4 5 6 7 8 9 10
How to initialize an empty vector, to store things in later
my_vec = vector(mode = 'numeric', length = 10 )
my_vec
## [1] 0 0 0 0 0 0 0 0 0 0
Initialize a matrix called my_mat
my_mat = matrix(1:10, ncol = 2, nrow = 5)
my_mat
## [,1] [,2]
## [1,] 1 6
## [2,] 2 7
## [3,] 3 8
## [4,] 4 9
## [5,] 5 10
class(my_mat)
## [1] "matrix"
Name the rows and columns using colnames() and row.names()
colnames(my_mat) = c('A', 'B')
row.names(my_mat) = c('Sample1', 'Sample2', 'Sample3', 'Sample4', 'Sample5')
my_mat
## A B
## Sample1 1 6
## Sample2 2 7
## Sample3 3 8
## Sample4 4 9
## Sample5 5 10
Get the dimensions of my_mat. Dimensions are reported as c(row, column) where numbers of rows in the first bin, and number of columns in the second bin
dim(my_mat)
## [1] 5 2
Can also ask for the number of rows and number of columns directly with ncol() and nrow()
ncol(my_mat)
## [1] 2
nrow(my_mat)
## [1] 5
Can see the column names and row names of my_mat with rownames() and colnames()
#get the column names of my_mat
colnames(my_mat)
## [1] "A" "B"
#get the row names of my_mat
rownames(my_mat)
## [1] "Sample1" "Sample2" "Sample3" "Sample4" "Sample5"
Format is: my_mat[row, column]
Can access with numeric indices
# row 2, column 1
my_mat[2,1]
## [1] 2
Can access with row names and column names
# row 2, column 1
my_mat['Sample2','A']
## [1] 2
Get all rows, 2nd column. Don’t specify a row, just use a comma
# all rows, 2nd column
my_mat[,2]
## Sample1 Sample2 Sample3 Sample4 Sample5
## 6 7 8 9 10
Get all columns, 1st row. Don’t specify a column, just use a comma
# all columns, 1st row
my_mat[1,]
## A B
## 1 6
Initialize a data frame
my_df = data.frame(A = c(1,2,3,4,5), B = c(6,7,8,9,10), row.names = c('Sample1', 'Sample2', 'Sample3', 'Sample4', 'Sample5'))
my_df
## A B
## Sample1 1 6
## Sample2 2 7
## Sample3 3 8
## Sample4 4 9
## Sample5 5 10
# get the class -- data.frame
class(my_df)
## [1] "data.frame"
Indexing dataframes is the same as matrices, but you can also index the columns using the $.
3 ways to access column A, the first column
my_df[,'A']
## [1] 1 2 3 4 5
my_df[,1]
## [1] 1 2 3 4 5
my_df$A #can't do this with matrices!
## [1] 1 2 3 4 5
Each element of a list can contain vectors, matrices, lists, etc. of any shape or size.
Here, we have a list where the first element will be a vector, the second is a matrix, and the third is a data.frame
my_list = list(my_vec, my_mat, my_df)
my_list
## [[1]]
## [1] 0 0 0 0 0 0 0 0 0 0
##
## [[2]]
## A B
## Sample1 1 6
## Sample2 2 7
## Sample3 3 8
## Sample4 4 9
## Sample5 5 10
##
## [[3]]
## A B
## Sample1 1 6
## Sample2 2 7
## Sample3 3 8
## Sample4 4 9
## Sample5 5 10
Use double square brackets to access a list element. Here, we’re accessing the first element of my_list.
my_list[[1]]
## [1] 0 0 0 0 0 0 0 0 0 0
Access the second element of the list
my_list[[2]]
## A B
## Sample1 1 6
## Sample2 2 7
## Sample3 3 8
## Sample4 4 9
## Sample5 5 10
Then you can subset it with matrix subsetting (since its a matrix)
my_list[[2]][1,1]
## [1] 1
my_list[[2]]['Sample1','A']
## [1] 1
Subset a list with single square brackets. This will make your list a length of 2 – containing only the 1st and 3rd elements of my_list()
my_list[c(1,3)]
## [[1]]
## [1] 0 0 0 0 0 0 0 0 0 0
##
## [[2]]
## A B
## Sample1 1 6
## Sample2 2 7
## Sample3 3 8
## Sample4 4 9
## Sample5 5 10
Say, your samples in my_df came from the following sources: either dog or cat as such.
sample_source = c('cat', 'dog', 'cat', 'cat', 'dog')
See how the length of sample_source is the same as the number of rows in my_df – as each sample is either from “dog” or “cat”
length(sample_source)
## [1] 5
dim(my_df)
## [1] 5 2
Create a logical vector – ask “where does sample source equal cat”. Will give TRUE when sample_source is cat and FALSE when sample_source is dog.
sample_source == 'cat'
## [1] TRUE FALSE TRUE TRUE FALSE
See that the class is a logical
class(sample_source == 'cat')
## [1] "logical"
We can subset the column ‘A’ in my_df on cat samples using the logical vector.
my_df$A[sample_source == 'cat']
## [1] 1 3 4
Similarily, we can subset my_df on the dog samples using another logical vector – and use comma to indicate we want all the columns.
my_df[sample_source == 'dog',]
## A B
## Sample2 2 7
## Sample5 5 10
for (thing in sequence){
do thing
}
Initialize vector of colors
colors = c('red', 'blue', 'green')
Create function my_fav_col() which takes as input a string and outputs another string “My favorite color is ____”
my_fav_col <- function(color){
return(paste('My favorite color is', color))
}
Iterating through indexing
for (i in 1:length(colors)){
print( my_fav_col(colors[i]) )
}
## [1] "My favorite color is red"
## [1] "My favorite color is blue"
## [1] "My favorite color is green"
Iterating through a vector called colors
for (thing in colors){
print( my_fav_col(thing) )
}
## [1] "My favorite color is red"
## [1] "My favorite color is blue"
## [1] "My favorite color is green"
Pass each element of colors to our function, my_fav_col
sapply(colors, my_fav_col)
## red blue
## "My favorite color is red" "My favorite color is blue"
## green
## "My favorite color is green"
Recall my_mat
my_mat
## A B
## Sample1 1 6
## Sample2 2 7
## Sample3 3 8
## Sample4 4 9
## Sample5 5 10
Our goal is to get the mean of each row. Let’s do so with both a for loop (2.2.1) and an apply statement (2.2.2)
for (i in 1:nrow(my_mat)){
row = my_mat[i,]
print(mean(row))
}
## [1] 3.5
## [1] 4.5
## [1] 5.5
## [1] 6.5
## [1] 7.5
apply(my_mat, 1, mean)
## Sample1 Sample2 Sample3 Sample4 Sample5
## 3.5 4.5 5.5 6.5 7.5
Say we have 100,000 circle radii (1-100,000) and we want to calculate the area of each circle.
# Vector of circle radii
radii = 1:100000
Here’s our function to calculate the area of a circle:
# Function for area of a circle
area_of_circle <- function(r){
return(pi * r^2)
}
Here’s an example of running our function on one radius – a radius of 10
# Example of using the function
area_of_circle(r = 10)
## [1] 314.1593
Let’s see how much time it takes to use 1) vectorized R code 2) sapply() statement versus 3) a for loop. We’ll run it N times to compare the trends.
# initialize vectors to store run time
repeats = 40
time_vec = vector('numeric', length = repeats)
time_sapply = vector('numeric', length = repeats)
time_for = vector('numeric', length = repeats)
for (i in 1:repeats){
start = Sys.time()
areas_method1 = area_of_circle(radii)
end = Sys.time()
time_vec[i] = as.numeric(end - start)
}
for (i in 1:repeats){
start = Sys.time()
areas_method2 = sapply(radii, area_of_circle)
end = Sys.time()
time_sapply[i] = as.numeric(end - start)
}
for (j in 1:repeats){
start = Sys.time()
areas_method3 = vector(mode = 'numeric', length = length(radii))
for (i in 1:length(radii)){
areas_method3[i] = area_of_circle(radii[i])
}
end = Sys.time()
time_for[j] = as.numeric(end - start)
}
max_val = max(c(time_vec, time_sapply, time_for)) + 0.01
hist(time_vec, col = rgb(1,0,0,0.5), breaks = seq(0,max_val, 0.01), xlim = c(0,max_val),
main = 'Comparing speed for vectorized code, for loop, apply', xlab = 'Run time (seconds)')
hist(time_sapply, col = rgb(0,1,0,0.5), breaks = seq(0,max_val, 0.01),add = TRUE)
hist(time_for, col = rgb(0,0,1,0.5), breaks = seq(0,max_val, 0.01),add = TRUE)
legend('topright', fill = c('red', 'green', 'blue'), legend = c('vectorized', 'sapply', 'for loop'))
To summarize:
vectorized code is faster than sapply() and for loops
for loops and sapply() are similar in speed, depending on the length of the vector and the task
for loops completes same task in multiple lines of code, vectorization & sapply in 1 line of code so you might want to use an apply family function for clarigy
Let’s explore with examples.
Read in TV ratings data
tv_data = read.csv('data/IMDb_Economist_tv_ratings.csv', stringsAsFactors = FALSE)
Let’s take a look at the TV ratings data
head(tv_data, 15)
## titleId seasonNumber title date av_rating share
## 1 tt2879552 1 11.22.63 2016-03-10 8.4890 0.51
## 2 tt3148266 1 12 Monkeys 2015-02-27 8.3407 0.46
## 3 tt3148266 2 12 Monkeys 2016-05-30 8.8196 0.25
## 4 tt3148266 3 12 Monkeys 2017-05-19 9.0369 0.19
## 5 tt3148266 4 12 Monkeys 2018-06-26 9.1363 0.38
## 6 tt1837492 1 13 Reasons Why 2017-03-31 8.4370 2.38
## 7 tt1837492 2 13 Reasons Why 2018-05-18 7.5089 2.19
## 8 tt0285331 1 24 2002-02-16 8.5641 6.67
## 9 tt0285331 2 24 2003-02-09 8.7028 7.13
## 10 tt0285331 3 24 2004-02-09 8.7173 5.88
## 11 tt0285331 4 24 2005-03-08 8.5610 3.48
## 12 tt0285331 5 24 2006-03-10 8.7828 2.15
## 13 tt0285331 6 24 2007-03-08 8.3522 2.23
## 14 tt0285331 7 24 2009-03-10 8.4922 2.04
## 15 tt0285331 8 24 2010-03-17 8.4744 1.55
## genres
## 1 Drama,Mystery,Sci-Fi
## 2 Adventure,Drama,Mystery
## 3 Adventure,Drama,Mystery
## 4 Adventure,Drama,Mystery
## 5 Adventure,Drama,Mystery
## 6 Drama,Mystery
## 7 Drama,Mystery
## 8 Action,Crime,Drama
## 9 Action,Crime,Drama
## 10 Action,Crime,Drama
## 11 Action,Crime,Drama
## 12 Action,Crime,Drama
## 13 Action,Crime,Drama
## 14 Action,Crime,Drama
## 15 Action,Crime,Drama
Say we wanted to figure out what the max number of seasons was for each show. We could approach this with a for loop or an sapply. Let’s do both and compare.
For loop.
#initialize vector to store max season number in
max_season = rep(NA, length(unique(tv_data$title)))
# loop through indices 1 to length of the unique TV show titles
# access the show name (show)
# get all the seasons for the show (all_season_numbers)
# get the max season number and store in max_season[index]
#name storage vector to be the show title
for (i in 1:length(unique(tv_data$title))){
show = unique(tv_data$title)[i]
all_season_numbers = tv_data$seasonNumber[tv_data$title == show]
max_season[i] = max(all_season_numbers)
}
names(max_season) = unique(tv_data$title)
head(max_season)
## 11.22.63 12 Monkeys 13 Reasons Why
## 1 4 2
## 24 24: Legacy 24: Live Another Day
## 8 1 1
sapply version
max_season = sapply(unique(tv_data$title), function(show){
all_season_numbers = tv_data$seasonNumber[tv_data$title == show]
max(all_season_numbers)
})
head(max_season)
## 11.22.63 12 Monkeys 13 Reasons Why
## 1 4 2
## 24 24: Legacy 24: Live Another Day
## 8 1 1
hist(max_season, breaks = seq(0,45,1), main = 'Number of seasons per show', xlab = 'Number of seasons')
avg_rating_per_show = sapply(unique(tv_data$title), function(show){
tv_data$av_rating[tv_data$title == show]
})
head(avg_rating_per_show)
## $`11.22.63`
## [1] 8.489
##
## $`12 Monkeys`
## [1] 8.3407 8.8196 9.0369 9.1363
##
## $`13 Reasons Why`
## [1] 8.4370 7.5089
##
## $`24`
## [1] 8.5641 8.7028 8.7173 8.5610 8.7828 8.3522 8.4922 8.4744
##
## $`24: Legacy`
## [1] 7.2044
##
## $`24: Live Another Day`
## [1] 8.899
Use an sapply() to figure out how many seasons/entries each show has
num_seasons = sapply(unique(tv_data$title), function(show){
length(tv_data$seasonNumber[tv_data$title == show])
})
Subset on the show titles with 2 seasons/entries
two_seasons = unique(tv_data$title)[num_seasons == 2]
Find the average rating for the shows with two seasons
avg_rating_two_seasons = sapply(two_seasons, function(show){
(tv_data$av_rating[tv_data$title == show])
})
head(t(avg_rating_two_seasons))
## [,1] [,2]
## 13 Reasons Why 8.4370 7.5089
## 7th Heaven 7.7000 6.3000
## A Series of Unfortunate Events 8.1052 8.2873
## Alphas 7.6436 7.8699
## American Crime Story 8.7267 8.3940
## American Gothic 7.8000 7.5350
avg_rating_two_seasons = lapply(two_seasons, function(show){
(tv_data$av_rating[tv_data$title == show])
})
head(avg_rating_two_seasons)
## [[1]]
## [1] 8.4370 7.5089
##
## [[2]]
## [1] 7.7 6.3
##
## [[3]]
## [1] 8.1052 8.2873
##
## [[4]]
## [1] 7.6436 7.8699
##
## [[5]]
## [1] 8.7267 8.3940
##
## [[6]]
## [1] 7.800 7.535
Here, we want to know what is the average rating per show across all seasaons
avg_rating_per_title = tapply(tv_data$av_rating, tv_data$title, mean)
head(avg_rating_per_title)
## 11.22.63 12 Monkeys 13 Reasons Why
## 8.489000 8.833375 7.972950
## 24 24: Legacy 24: Live Another Day
## 8.580850 7.204400 8.899000
Now let’s play with apply statements with another dataset – bacterial growth curve data.
growth = read.csv('data/mock_growth_curve_data.csv', row.names = 1)
Let’s take a look at how the data is organized. We have 8 wells growing bacteria, where each well is a column. Rows indicate time, and time was taken every 15 minutes.
head(growth)
## Well1 Well2 Well3 Well4 Well5 Well6 Well7 Well8
## 0 0.122 0.098 0.111 0.116 0.181 0.180 0.187 0.184
## 15 0.090 0.091 0.090 0.090 0.161 0.161 0.159 0.159
## 30 0.092 0.093 0.092 0.092 0.186 0.186 0.185 0.185
## 45 0.097 0.097 0.097 0.097 0.227 0.228 0.227 0.228
## 60 0.105 0.105 0.105 0.104 0.299 0.300 0.300 0.301
## 75 0.118 0.116 0.118 0.117 0.394 0.394 0.395 0.395
Say we want to take the mean at each timepoint. We can use apply, loop over the rows (1), and use the function mean()
mean_growth = apply(growth, 1, mean)
plot(row.names(growth), mean_growth, type = 'l', ylab = 'OD', xlab = 'Time (min)', main = 'Average growth')
We can also plot our curves with apply. Here, we can plot each well on its own plot.
apply(growth, 2, function(col){
plot(row.names(growth), col, type = 'l', ylab = 'OD', xlab = 'Time (min)')
})
## NULL
We can also plot all curves on the same plot
plot(row.names(growth), growth$Well1, type = 'l', ylab = 'OD', xlab = 'Time (min)')
apply(growth, 2, function(col){
lines(row.names(growth), col)
})
## NULL
It looks like there are two growth patterns. Let’s explore by plotting the max difference at each time point.
max_difference = apply(growth, 1, function(row){
max(row) - min(row)
})
plot(row.names(growth), max_difference, type = 'l', ylab = 'max OD difference', xlab = 'Time (min)')
Here’s a Data Camp tutorial to learn more about paralell programming in R – specifically future and future.apply()