This document contains the solutions for Data Carpentry for Ecologists (ECOL 8030) week 8: Programming Fundamentals 2.
For this, you need a function from last week.
get_mass_from_length <- function(length, a,b){
#' allometric mass equation
#' takes the length of any dinosaur and returns its mass based on an allometric approach (mass= a * length ^ b)
#' @param length length of the dinosaur in meters
#' @param a scaling of allometric equation
#' @param b exponent in allometric equation
#' @return mass of the dinosaur in kg
mass <- a * length ** b
return(mass)
}
Now we can run a for-loop over the lengths
lengths <- c(10.1, 9.5, 11.2, 9.8, 10.4, 12.0, 11.5, 9.5, 9.8, 10.0, 10.7, 10.2, 11.9, 9.7, 11.2, 11.8, 10.7)
for (i in lengths){
mass <- get_mass_from_length(length=i, a=10.95, b=2.64)
print(mass)
}
## [1] 4907.073
## [1] 4174.5
## [1] 6446.882
## [1] 4531.594
## [1] 5301.297
## [1] 7734.863
## [1] 6912.839
## [1] 4174.5
## [1] 4531.594
## [1] 4779.848
## [1] 5714.619
## [1] 5036.381
## [1] 7565.856
## [1] 4410.537
## [1] 6446.882
## [1] 7399.163
## [1] 5714.619
Or just try applying the function to a vector:
get_mass_from_length(length=lengths, a=10.95, b=2.64)
## [1] 4907.073 4174.500 6446.882 4531.594 5301.297 7734.863 6912.839
## [8] 4174.500 4531.594 4779.848 5714.619 5036.381 7565.856 4410.537
## [15] 6446.882 7399.163 5714.619
They are the same!
Read the csv:
sequences <- read.csv("https://mvevans89.github.io/ECOL-8030/data/dna-sequences-1.txt", stringsAsFactors = F) #note we need the strings to not be factors
This code was sent by a colleague:
library(stringr)
sequence <- "attggc"
Gs <- str_count(sequence, "g")
Cs <- str_count(sequence, "c")
gc_content <- (Gs + Cs) / str_length(sequence) * 100
library(stringr) #load package
#' Calculate GC content of a sequence
#' @param sequence dna sequence
#' @return the percentage of g and c in the sequence
getGCcontent <- function(sequence){
Gs <- str_count(sequence, "g")
Cs <- str_count(sequence, "c")
gc_content <- (Gs + Cs) / str_length(sequence) * 100
return(gc_content)
}
We also need to store the results in a new vector.
content <- c() #create empty object to write too. not 100% sure this is necessary
for (i in 1:nrow(sequences)){ #iterate over each row
seq <- sequences[i,1] # select out the sequence to use in this for loop
content[i]<- getGCcontent(seq) #write to the appropriate spot in the empty object
}
To do this, we just fill in the code that was provided:
# pre-allocate the memory with one row for each sequence
gc_contents <- data.frame(gc_content = numeric(nrow(sequences)))
# loop over sequences using an index for the row and
# store the output in gc_contents
for (i in 1:nrow(sequences)){
gc_contents[i,] <- getGCcontent(sequences[i,1])
}
Write a function, dna_or_rna(sequence), that determines if a sequence of base pairs is DNA, RNA, or if it is not possible to tell given the sequence provided. Since all the function will know about the material is the sequence the only way to tell the difference between DNA and RNA is that RNA has the base Uracil (“u”) instead of the base Thymine (“t”). Have the function return one of three outputs: “DNA”, “RNA”, or “UNKNOWN”.
#' DNA or RNA
#' This function determines if a sequences is DNA or RNA based on u or t
#' @param sequence sequence you want to test, must be in lowercase
#' @return "DNA", "RNA", or "UNKNOWN"
dna_or_rna <- function(sequence){
if (str_count(sequence, "t")>0){ #use str_count to count number of t's
return("DNA")
} else if (str_count(sequence, "u")>0){
return("RNA")
} else return("UNKNOWN") #returns unknown if neither t nor u are there
}
We can also make it so that it works even if the sequences are reported in uppercase.
#' DNA or RNA
#' This function determines if a sequences is DNA or RNA based on u or t
#' @param sequence sequence you want to test, accepts all cases
#' @return "DNA", "RNA", or "UNKNOWN"
dna_or_rna <- function(sequence){
if (str_count(tolower(sequence), "t")>0){ #tolower switches strings to lowercase
return("DNA")
} else if (str_count(tolower(sequence), "u")>0){
return("RNA")
} else return("UNKNOWN")
}
sequences = c("ttgaatgccttacaactgatcattacacaggcggcatgaagcaaaaatatactgtgaaccaatgcagcg", "gauuauuccccacaaagggagugggauuaggagcugcaucauuuacaagagcagaauguuucaaaugcau","gaaagcaagaaaaggcaggcgaggaagggaagaagggggggaaacc", "guuuccuacaguauuugaugagaaugagaguuuacuccuggaagauaauauuagaauguuuacaacugcaccugaucagguggauaaggaagaugaagacu", "gauaaggaagaugaagacuuucaggaaucuaauaaaaugcacuccaugaauggauucauguaugggaaucagccggguc")
for (sq in sequences){
print(dna_or_rna(sq))
}
## [1] "DNA"
## [1] "RNA"
## [1] "UNKNOWN"
## [1] "RNA"
## [1] "RNA"
sapply
statementsapply(sequences, dna_or_rna)
## ttgaatgccttacaactgatcattacacaggcggcatgaagcaaaaatatactgtgaaccaatgcagcg
## "DNA"
## gauuauuccccacaaagggagugggauuaggagcugcaucauuuacaagagcagaauguuucaaaugcau
## "RNA"
## gaaagcaagaaaaggcaggcgaggaagggaagaagggggggaaacc
## "UNKNOWN"
## guuuccuacaguauuugaugagaaugagaguuuacuccuggaagauaauauuagaauguuuacaacugcaccugaucagguggauaaggaagaugaagacu
## "RNA"
## gauaaggaagaugaagacuuucaggaaucuaauaaaaugcacuccaugaauggauucauguaugggaaucagccggguc
## "RNA"
First, check out the wikipedia page on archaea because wow are they cool. You’ll need this data for the exercise.
Then install the necessary software (this can take a while):
source("https://bioconductor.org/biocLite.R")
biocLite("ShortRead")
Why is this different than normally installing packages? Bioconductor, from what I can tell, is a collection of Biostats tools and packages for R. The first line of the code above is the function that runs the download of a package called ShortRead
. This package reads .fasta
files, which I guess are some kind of genetics thing??
library(ShortRead)
## Loading required package: BiocGenerics
## Loading required package: parallel
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
##
## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
## clusterExport, clusterMap, parApply, parCapply, parLapply,
## parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:stats':
##
## IQR, mad, xtabs
## The following objects are masked from 'package:base':
##
## anyDuplicated, append, as.data.frame, cbind, colnames,
## do.call, duplicated, eval, evalq, Filter, Find, get, grep,
## grepl, intersect, is.unsorted, lapply, lengths, Map, mapply,
## match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
## Position, rank, rbind, Reduce, rownames, sapply, setdiff,
## sort, table, tapply, union, unique, unsplit, which, which.max,
## which.min
## Loading required package: BiocParallel
## Loading required package: Biostrings
## Loading required package: S4Vectors
## Loading required package: stats4
##
## Attaching package: 'S4Vectors'
## The following objects are masked from 'package:base':
##
## colMeans, colSums, expand.grid, rowMeans, rowSums
## Loading required package: IRanges
## Loading required package: XVector
## Loading required package: Rsamtools
## Loading required package: GenomeInfoDb
## Loading required package: GenomicRanges
## Loading required package: GenomicAlignments
## Loading required package: SummarizedExperiment
## Loading required package: Biobase
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
reads <- readFasta("data/archaea-dna/A-saccharovorans.fasta")
seq <- sread(reads)
#if you're curious about what this is, just str it
str(seq)
## Formal class 'DNAStringSet' [package "Biostrings"] with 5 slots
## ..@ pool :Formal class 'SharedRaw_Pool' [package "XVector"] with 2 slots
## .. .. ..@ xp_list :List of 1
## .. .. .. ..$ :<externalptr>
## .. .. ..@ .link_to_cached_object_list:List of 1
## .. .. .. ..$ :<environment: 0x579d7a0>
## ..@ ranges :Formal class 'GroupedIRanges' [package "XVector"] with 7 slots
## .. .. ..@ group : int 1
## .. .. ..@ start : int 1
## .. .. ..@ width : int 1496453
## .. .. ..@ NAMES : NULL
## .. .. ..@ elementType : chr "integer"
## .. .. ..@ elementMetadata: NULL
## .. .. ..@ metadata : list()
## ..@ elementType : chr "DNAString"
## ..@ elementMetadata: NULL
## ..@ metadata : list()
head(seq)
## A DNAStringSet instance of length 1
## width seq
## [1] 1496453 GTTTCAACACCATTCCTTGGTTTCCCATTG...CACAGGCCTCCCCTGAGGCCCAGACATCCA
We need to modify our above gccontent function to take both upper and lower case.
#' Calculate GC content of a sequence
#' @param sequence dna sequence
#' @return the percentage of g and c in the sequence
getGCcontent <- function(sequence){
Gs <- str_count(tolower(sequence), "g")
Cs <- str_count(tolower(sequence), "c")
gc_content <- (Gs + Cs) / str_length(sequence) * 100
return(gc_content)
}
Now we will write a for loop to get this for all of the files in that zip file.
#get list of files in directory. we need the full names so we can load them later
files <- list.files(path="data/archaea-dna", full.names=T)
#get only the file names for putting species in later
filenames <- list.files(path="data/archaea-dna", full.names=F)
#create empty dataframe
archaeaData <- data.frame(Species=character(), gcContent=numeric(), stringsAsFactors = F)
for (i in 1:length(files)){
#read in fasta files and format them as above
reads <- readFasta(files[i])
seq <- sread(reads)
#get species name using gsub
archaeaData[i,1] <- gsub(".fasta", "", filenames[i]) #use gsub to replace .fasta with nothing
archaeaData[i,2] <- getGCcontent(seq)
}
The way I went about getting the shorter filenames was not the best. Ideally, one would use a regex expression like grep
to select a certain subset of the character string, but regex confuses me a lot. If you’re interested in regex, this website allows you to look at your regex commands interactively.
ears <- read.csv("https://mvevans89.github.io/ECOL-8030/data/houseelf-earlength-dna-data.csv")
I chose to do this using rowwise()
in dplyr
. This seems the most efficient method, but may be confusign because it has a lot of moving parts.
#create a function that sorts ears with an if-else statement
#' Ear Sorting
#' This function identifies ears as large (>10cm) or small (<= 10cm)
#' @param length the length of an elf's ears in cm
#' @return if the ear is large or small
earSort <- function(length){
if (length>10){
return("large")
} else return("small")
}
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:ShortRead':
##
## id
## The following objects are masked from 'package:GenomicAlignments':
##
## first, last
## The following object is masked from 'package:Biobase':
##
## combine
## The following objects are masked from 'package:GenomicRanges':
##
## intersect, setdiff, union
## The following object is masked from 'package:GenomeInfoDb':
##
## intersect
## The following objects are masked from 'package:Biostrings':
##
## collapse, intersect, setdiff, setequal, union
## The following object is masked from 'package:XVector':
##
## slice
## The following objects are masked from 'package:IRanges':
##
## collapse, desc, intersect, regroup, setdiff, slice, union
## The following objects are masked from 'package:S4Vectors':
##
## first, intersect, rename, setdiff, setequal, union
## The following objects are masked from 'package:BiocGenerics':
##
## combine, intersect, setdiff, union
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
earsProcessed <- ears %>%
rowwise() %>% #apply this function by row
mutate(size=earSort(earlength)) %>% #add a new column sorting using the earSort function
mutate(GC=getGCcontent(dnaseq)) %>% #create column of GC content
select(-earlength, - dnaseq)
earsProcessed
## Source: local data frame [10 x 3]
## Groups: <by row>
##
## # A tibble: 10 × 3
## id size GC
## <fctr> <chr> <dbl>
## 1 17A small 41
## 2 24P small 39
## 3 09Q large 57
## 4 65Y small 40
## 5 19N small 36
## 6 92K large 62
## 7 33W small 52
## 8 98C large 63
## 9 75G small 47
## 10 88Q large 52
write.csv(earsProcessed, "grangers_analysis.csv")
means <- earsProcessed %>%
group_by(size) %>%
summarise(meanGC=mean(GC))
## Warning: Grouping rowwise data frame strips rowwise nature
means
## # A tibble: 2 × 2
## size meanGC
## <chr> <dbl>
## 1 large 58.5
## 2 small 42.5
The most useful thing I learned this week was the rowwise()
call to dplyr
, which I hadn’t known before. Often I want to add a column that is conditional on values in another and have struggled about the workflow to go about this. Creating a conditional function, and then using that rowwise inside dplyr
seems like an easy and straightforward way to do this, and fits will with my workflow already using dplyr
.