Learning Objectives

Following this assignment students should be able to:

  • understand basic built-in and stringr functions
  • manipulate strings for data analysis

Exercises

  1. -- Print Strings --

    1. Print the following: Post hoc ergo propter hoc

    2. Print the following with no quotes: What’s up with scientists using all of this snooty latin?

    3. Print the following with no quotes and an extra blank line (?cat): Darwin’s “On the origin of species” is a seminal work in biology.

    4. Assign x <- 3, then paste in the appropriate location of the statement: Then shalt thou count to x, no more, no less.

    [click here for output]
  2. -- Built-in Functions --

    Use the built-in functions abs(), round(), sqrt(). A built-in function is one that you don’t need to install and load a package to use. Use another function, help(), to learn how to use any of the functions that you don’t know how to use appropriately. help() takes one parameter, the name of the function you want information about. E.g.,help(round).

    1. The absolute value of -15.5.
    2. 4.483847 rounded to one decimal place. The function round() takes two arguments, the number to be rounded and the number of decimal places.
    3. 3.8 rounded to the nearest integer. You don’t have to specify the number of decimal places in this case if you don’t want to, because round() will default to using 0 if the second argument is not provided. Look at help(round) or ?round to see how this is indicated.
    4. Assign the value of the square root of 2.6 to a variable. Then round the variable you’ve created to 2 decimal places and assign it to another variable. Print out the rounded value.
    5. Do the same thing as task 4 (immediately above), but instead of creating the intermediate variable, perform both the square root and the round on a single line by putting the sqrt() call inside the round() call.
    [click here for output]
  3. -- Built-in Functions --

    Use the built-in character functions tolower and toupper to manipulate and print the following string.

    1. "species" in all capital letters
    2. "SPECIES" in all lower case letters
    [click here for output]
  4. -- stringr Functions --

    Use the character functions from the package stringr to print the following strings.

    1. "atgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgc". Do this by duplicating “atgc” 15 times.
    2. " Thank goodness it's Friday" without the leading white space (i.e., without the spaces before "Thank").
    3. "gcagtctgaggattccaccttctacctgggagagaggacatactatatcgcagcagtggaggtggaatgg" with all of the occurences of "a" replaced with "A".
    4. Print the length of this dna sequence "gccgatgtacatggaatatacttttcaggaaacacatatctgtggagagg".
    5. The number of "a"s in "gccgatgtacatggaatatacttttcaggaaacacatatctgtggagagg".
    6. Print the first 20 positions of this dna sequence "gccgatgtacatggaatatacttttcaggaaacacatatctgtggagagg".
    7. Print the last 10 positions of this dna sequence "gccgatgtacatggaatatacttttcaggaaacacatatctgtggagagg".
    [click here for output]
  5. -- Strings and Math --

    The length of an organism is typically strongly correlated with its body mass. This is useful because it allows us to estimate the mass of an organism even if we only know its length. This relationship generally takes the form

    Mass (kg) = a * Length(m)b

    where the parameters a and b vary among groups. Write a script that prompts the user for the following pieces of information:

    1. genus name
    2. species name
    3. the length of the species

    and then estimates the mass of the organism using the equation above. The script should paste the result as:

    Genus species is length meters long and weighs approximately mass kg.

    where the words in italics are replaced with the appropriate values. As is standard practice the first letter (and only the first letter) of the Genus name should be capitalized, and the species name should appear in all lower case letters when input.

    An allometric approach is regularly used to estimate the mass of dinosaurs since we cannot typically weigh something that is only preserved as bones. I’ll be testing your script using the length of a Spinosaurus (Spinosaurus aegyptiacus), which is 16 m long based on its reassembled skeleton. So, use the values of a and b for Theropoda (the appropriate dinosaur clade): a has been estimated as 0.73 and b has been estimated as 3.63 (Seebacher 2001). Spinosaurus is a predator that is bigger, and therefore, by definition, cooler, than that stupid Tyrannosaurus that everyone likes so much.

    [click here for output]
  6. -- Long Strings --

    For the DNA sequence below determine the following properties and print them to the screen (you can cut and paste the following into your code, it’s a lot longer than you can see on the screen, but just select the whole thing and when you paste it into R you’ll see what it looks like):

    dna="ttcacctatgaatggactgtccccaaagaagtaggacccactaatgcagatcctgtgtgtctagctaagatgtattattctgctgtggatcccactaaagatatattcactgggcttattgggccaatgaaaatatgcaagaaaggaagtttacatgcaaatgggagacagaaagatgtagacaaggaattctatttgtttcctacagtatttgatgagaatgagagtttactcctggaagataatattagaatgtttacaactgcacctgatcaggtggataaggaagatgaagactttcaggaatctaataaaatgcactccatgaatggattcatgtatgggaatcagccgggtctcactatgtgcaaaggagattcggtcgtgtggtacttattcagcgccggaaatgaggccgatgtacatggaatatacttttcaggaaacacatatctgtggagaggagaacggagagacacagcaaacctcttccctcaaacaagtcttacgctccacatgtggcctgacacagaggggacttttaatgttgaatgccttacaactgatcattacacaggcggcatgaagcaaaaatatactgtgaaccaatgcaggcggcagtctgaggattccaccttctacctgggagagaggacatactatatcgcagcagtggaggtggaatgggattattccccacaaagggagtgggattaggagctgcatcatttacaagagcagaatgtttcaaatgcatttttagataagggagagttttacataggctcaaagtacaagaaagttgtgtatcggcagtatactgatagcacattccgtgttccagtggagagaaaagctgaagaagaacatctgggaattctaggtccacaacttcatgcagatgttggagacaaagtcaaaattatctttaaaaacatggccacaaggccctactcaatacatgcccatggggtacaaacagagagttctacagttactccaacattaccaggtaaactctcacttacgtatggaaaatcccagaaagatctggagctggaacagaggattctgcttgtattccatgggcttattattcaactgtggatcaagttaaggacctctacagtggattaattggccccctgattgtttgtcgaagaccttacttgaaagtattcaatcccagaaggaagctggaatttgcccttctgtttctagtttttgatgagaatgaatcttggtacttagatgacaacatcaaaacatactctgatcaccccgagaaagtaaacaaagatgatgaggaattcatagaaagcaataaaatgcatgctattaatggaagaatgtttggaaacct"

    1. How long is the sequence?
    2. How many occurences of "gagg" occur in the sequence?
    3. What is the starting position of the first occurrence of "atta"?
    4. What is the GC content of the sequence? The GC content is the percentage of bases that are either G or C (as a percentage of total base pairs). Paste the result as “The GC content of this sequence is XX.XX%” where XX.XX is the actual GC content.
    [click here for output]
  7. -- Strings from Data --

    A colleague has produced a file with one DNA sequence on each line. Download the file and load it into R using read.csv(). The file has no header and is separated by white space ("").

    Calculate the GC content of each sequence. The GC content is the percentage of bases that are either G or C (as a percentage of total base pairs). Print each GC content in order to the screen (in %).

    [click here for output]
  8. -- String Data --

    This is a follow up to Strings from Data.

    A colleague has produced a file with one DNA sequence on each line. Download the file and load it into R using read.csv(). The file has no header.

    Write a function to calculate GC content. GC content is the percentage of bases that are either G or C as a percentage of total base pairs. Your function should take a dna sequence as input and return the GC-content of that sequence. Print the result for each sequence.

    Before we knew about functions we had to take each dna sequence one at a time and then rewrite or copy-paste the same code to analyze each one. Isn’t this better?

    You may have noticed that for Loop prints the results differently. read.csv() imports the data as a data.frame(), unlike the numeric vector in the previous exercise.

    [click here for output]
  9. -- Improve Your Code --

    This is a follow up to String Data.

    A colleague has produced a file with one DNA sequence on each line. So far you’ve been manually extracting each DNA sequence and calculating it’s GC content, which as worked OK with five sequences, but isn’t going to work very well when the sequencer really gets going and you have to handle 100s-1000s of sequences.

    Use a for loop and your function from String Data to calculate the GC content of each sequence and print them out. The function should work on a single sequence at a time and the for loop should repeatedly call the function and print out the result.

    [click here for output]
  10. -- Split Strings --

    You have a data file with a single "taxonomy" column in it. This column contains the family, genus, and species for a single taxonomic group. You need to figure out how to split that information into separate values for family, genus, and species. To solve the basic problem take a single example string, "Ornithorhynchidae Ornithorhynchus anatinus", split it into three separate strings using a stringr command, and then print the family, genus, and species, each on a separate line.

    [click here for output]