Learning Objectives

Following this assignment students should be able to:

  • use for loops to automate function operations
  • understand how to decompose complex problems

Reading

Lecture Notes

  1. Loops
  2. Other Data Structures
  3. Problem Decomposition - Lecture | R-script

Exercises

  1. -- for Loop --

    This is a follow up to Nested Functions.

    1. Now that you’ve impressed Grandma, it’s time to do some serious science. Take the following vector of Stegosaur lengths

      lengths <- c(10.1, 9.5, 11.2, 9.8, 10.4, 12.0, 11.5, 9.5,
      9.8, 10.0, 10.7, 10.2, 11.9, 9.7, 11.2, 11.8, 10.7)
      

      and estimate the mass in kilograms for each length using a for loop, your function for estimating mass, a = 10.95, and b = 2.64. Print the results in order.

    2. This is a good way to learn how to use a for loop, but thanks to vectorization in R we can also just pass the entire lengths vector to our function. Use this approach to estimate the mass for each length and display the output.

    [click here for output] [click here for output]
  2. -- stringr --

    A colleague has produced a file with one DNA sequence on each line. Download the file and load it into R using read.csv(). The file has no header. Name the resulting data frame sequences.

    Your colleague wants to calculate the GC content of each DNA sequence (i.e., the percentage of bases that are either G or C) and knows just a little R. They sent you the following code which will calculate the GC content for a single sequence:

    library(stringr)
    
    sequence <- "attggc"
    Gs <- str_count(sequence, "g")
    Cs <- str_count(sequence, "c")
    gc_content <- (Gs + Cs) / str_length(sequence) * 100 
    

    This code uses the excellent stringr package for working with the sequence data. You’ll need to install this package before using it.

    1. Convert the last three lines of this code into a function to calculate the GC content of a DNA sequence. Name that function get_gc_content.

    2. Use a for loop and your function to calculate the GC content of each sequence and store the results in a new vector. The function should work on a single sequence at a time and the for loop should repeatedly call the function and store the output.

    3. Use a for loop and your function to calculate the GC content of each sequence and store the results in a new data frame. To do this you’ll need to use an index to loop over the rows of the data frame.

      Fill in the following for loop to complete this exercise:

    # pre-allocate the memory with one row for each sequence
    gc_contents <- data.frame(gc_content = numeric(nrow(_______)))
    
    # loop over sequences using an index for the row and
    # store the output in gc_contents
    for (i in 1:nrow(__________)){
      ________[i,] <- get_gc_content(sequences[____])
    }
    
    [click here for output]
  3. -- DNA or RNA --

    Write a function, dna_or_rna(sequence), that determines if a sequence of base pairs is DNA, RNA, or if it is not possible to tell given the sequence provided. Since all the function will know about the material is the sequence the only way to tell the difference between DNA and RNA is that RNA has the base Uracil ("u") instead of the base Thymine ("t"). Have the function return one of three outputs: "DNA", "RNA", or "UNKNOWN".

    1. Use the function and a for loop to print the type of the sequences in the following list.
    2. Use the function and sapply to print the type of the sequences in the following list.
    sequences = c("ttgaatgccttacaactgatcattacacaggcggcatgaagcaaaaatatactgtgaaccaatgcaggcg", "gauuauuccccacaaagggagugggauuaggagcugcaucauuuacaagagcagaauguuucaaaugcau", "gaaagcaagaaaaggcaggcgaggaagggaagaagggggggaaacc", "guuuccuacaguauuugaugagaaugagaguuuacuccuggaagauaauauuagaauguuuacaacugcaccugaucagguggauaaggaagaugaagacu", "gauaaggaagaugaagacuuucaggaaucuaauaaaaugcacuccaugaauggauucauguaugggaaucagccggguc")
    

    Optional: For a little extra challenge make your function work with both upper and lower case letters, or even strings with mixed capitalization

    [click here for output]
  4. -- Multiple Files --

    This is a follow-up to stringr.

    Dr. Jekyll is hard at work to perfect his serum and correct the imbalance with his alter ego, Mr. Hyde. Dr. Jekyll is convinced that some mutation in his DNA is responsible for his transformations and he’s looking in the PATRIC bacterial phytogenomic database for clues. He wants to know the GC content of all of the bacteria in the database and got started working with a handful of archaea. Sadly, his skill with a burner and pipette has not prepared him at all for work on a computer.

    Help him out by downloading the data and looping over the files to determine the GC content for each file. Unzip the the .zip file into your data directory. If you look at the data you’ll see that it’s made up of one file per species using the FASTA dna sequence format. We could try to load it using read.csv, but the ShortRead package in Bioconductor already exists for parsing fasta files, so we’ll use that instead. Install Bioconductor if you haven’t already.

    source("https://bioconductor.org/biocLite.R")
    biocLite("ShortRead")
    

    The following code will then load a single sequence file:

    library(ShortRead)
    reads <- readFasta("data/archaea-dna/A-saccharovorans.fasta")
    seq <- sread(reads)
    

    You can reuse the GC content function you wrote for stringr to calculate the GC content, but you might need to modify it to accommodate the different capitalization of the bases.

    Each file in the zip represents a single archaea species. Use a for loop and your function to calculate the GC content of each file and print them out individually. You might find the list.files() function and the full.names = TRUE argument useful for working with multiple files in a for loop. The function should work on a single file at a time and the for loop should repeatedly call the function and store the results in a data frame with a row for each file and columns for both the file name and GC content.

    Optional: For a little extra challenge change your answer so that instead of printing out the file names it prints out the species name that is encoded in the file name without the ".data/archaea-dna/" path at the beginning and the ".fasta" extension at the end.

    [click here for output]
  5. -- Data Management Review --

    Dr. Granger is interested in studying the relationship between the length of house-elves’ ears and aspects of their DNA. This research is part of a larger project attempting to understand why house-elves possess such powerful magic. She has obtained DNA samples and ear measurements from a small group of house-elves to conduct a preliminary analysis (prior to submitting a grant application to the Ministry of Magic) and she would like you to conduct the analysis for her (she might know everything there is to know about magic, but she sure doesn’t know much about computers). She has placed the data in a file on the web for you to download.

    Write an R script that:

    • Imports the data
    • For each row in the dataset checks to see if the ear length is "large" (>10 cm) or "small" (<=10 cm) and determines the GC-content of the DNA sequence (i.e., the percentage of bases that are either G or C)
    • Stores this information in a table where the first column has the ID for the individual, the second column contains the string "large" or the string "small" depending on the size of the individuals ears, and the third column contains the GC content of the DNA sequence.
    • Exports this table to a csv (comma separated values) file titled grangers_analysis.csv.
    • Prints the average GC-contents for large-eared elves and small-eared elves to the screen.

    As you start to work on more complex problems it’s important to break them down into manageable pieces. One natural way to break this list of things down is: 1) import data; 2) determine size category; 3) determine GC-content; 4) calculate the size category and GC-content for each row of data and store it; 5) export this data to csv; 6) calculate and print the average GC-content for large and small ears.

    Use functions to break the code up into manageable pieces. Remember to document your code well.

    There are several different specific approaches you could take to doing calculations for each row of data. One is to use dplyr using the rowwise() function (here’s an example). Another is to loop over the rows in the data.frame using

    for (row in 1:nrow(data)){...}

    A third is to break the data.frame into vectors and use sapply().

    Ask your instructor if you have questions about the best choices.

    [click here for output] [click here for output]

Check out the solution