A colleague has produced a file with one DNA sequence on each line. Download
the file and load it into R using
read.csv()
. The file has no header. Name the resulting data frame sequences
.
Your colleague wants to calculate the GC content of each DNA sequence (i.e., the percentage of bases that are either G or C) and knows just a little R. They sent you the following code which will calculate the GC content for a single sequence:
library(stringr)
sequence <- "attggc"
Gs <- str_count(sequence, "g")
Cs <- str_count(sequence, "c")
gc_content <- (Gs + Cs) / str_length(sequence) * 100
This code uses the excellent
stringr
package
for working with the sequence data. You’ll need to install this package before
using it.
Convert the last three lines of this code into a function to calculate the GC
content of a DNA sequence. Name that function get_gc_content
.
Use a for
loop and your function to calculate the GC content of each
sequence and store the results in a new vector. The function should work on a
single sequence at a time and the for
loop should repeatedly call the function
and store the output.
Use a for
loop and your function to calculate the GC content of each sequence
and store the results in a new data frame. To do this you’ll need to use an
index
to loop over the rows of the data frame.
Fill in the following for
loop to complete this exercise:
# pre-allocate the memory with one row for each sequence
gc_contents <- data.frame(gc_content = numeric(nrow(_______)))
# loop over sequences using an index for the row and
# store the output in gc_contents
for (i in 1:nrow(__________)){
________[i,] <- get_gc_content(sequences[____])
}