This is a follow-up to stringr.
Dr. Jekyll is hard at work to perfect his serum and correct the imbalance with his alter ego, Mr. Hyde. Dr. Jekyll is convinced that some mutation in his DNA is responsible for his transformations and he’s looking in the PATRIC bacterial phytogenomic database for clues. He wants to know the GC content of all of the bacteria in the database and got started working with a handful of archaea. Sadly, his skill with a burner and pipette has not prepared him at all for work on a computer.
Help him out by downloading the data
and looping over the files to determine the GC content for each file. Unzip the
the .zip file into your data
directory. If you look at the data you’ll see
that it’s made up of one file per species using the FASTA dna sequence format. We
could try to load it using read.csv
, but the ShortRead package in Bioconductor
already exists for parsing fasta files, so we’ll use that instead. Install Bioconductor if
you haven’t already.
source("https://bioconductor.org/biocLite.R")
biocLite("ShortRead")
The following code will then load a single sequence file:
library(ShortRead)
reads <- readFasta("data/archaea-dna/A-saccharovorans.fasta")
seq <- sread(reads)
You can reuse the GC content function you wrote for stringr to calculate the GC content, but you might need to modify it to accommodate the different capitalization of the bases.
Each file in the zip represents a single archaea species. Use a for
loop and
your function to calculate the GC content of each file and print them out
individually. You might find the list.files()
function and the
full.names = TRUE
argument useful for working with multiple files in a for
loop. The function should work on a single file at a time and the for
loop
should repeatedly call the function and store the results in a data frame with a
row for each file and columns for both the file name and GC content.
Optional: For a little extra challenge change your answer so that instead of
printing out the file names it prints out the species name that is encoded in
the file name without the ".data/archaea-dna/"
path at the beginning and the
".fasta"
extension at the end.