Learning Objectives
Following this assignment students should be able to:
- install and load an R package
- understand the data manipulation functions of
dplyr
- execute a simple import and analyze data scenario
Reading
-
Topics
dplyr
-
Readings
dplyr
vignette- Optional Resources:
- Analyzing data with
dplyr
- If you choose to do this, make sure to download the dataset and load it in R as
surveys
(hint useread.csv
)
- If you choose to do this, make sure to download the dataset and load it in R as
- R for Data Science - Data transformation
- Analyzing data with
Lecture Notes
Exercises
-- dplyr --
Install and familiarize yourself with the
dplyr
package. Thelibrary()
step(s) should always be located at the very top of a script.install.packages("dplyr") library(dplyr) help(package = dplyr)
This vignette is a great reference for data manipulation verbs to keep in mind.
-- Shrub Volume Data Basics --
This is a follow-up to Shrub Volume Data Frame.
Dr. Granger is interested in studying the factors controlling the size and carbon storage of shrubs. This research is part of a larger area of research trying to understand carbon storage by plants. She has conducted a small preliminary experiment looking at the effect of three different treatments on shrub volume at four different locations. She has placed the data file on the web for you to download:
Download this into your
data
folder and get familiar with the data by importing the shrub dimensions data usingread.csv()
and then:- Check the column names in the data using the function
names()
. - Use
str()
to show the structure of the data frame and its individual columns. -
Print out the first few rows of the data using the function
head()
.Use
dplyr
to complete the remaining tasks. - Select the data from the length column and print it out.
- Select the data from the site and experiment columns and print it out.
- Filter the data for all of the plants with heights greater than 5 and print out the result.
- Create a new data frame called
shrub_data_w_vols
that includes all of the original data and a new column containing the volumes, and display it.
- Check the column names in the data using the function
-- Shrub Volume Aggregation --
This is a follow-up to Shrub Volume Data Basics.
Dr. Granger wants some summary data of the plants at her sites and for her experiments. Make sure you have her shrub dimensions data.
This code calculates the average height of a plant at each site:
by_site <- group_by(shrub_dims, site) avg_height <- summarize(by_site, avg_height = mean(height))
- Modify the code to calculate and print the average height of a plant in each experiment.
- Use
max()
to determine the maximum height of a plant at each site.
-- Shrub Volume Join --
This is a follow-up to Shrub Volume Aggregation.
Dr. Granger has kept a separate table that describes the
manipulation
for eachexperiment
. Add the experiments data to yourdata
folder.Import the experiments data and then use
[click here for output]inner_join
to combine it with the shrub dimensions data to add amanipulation
column to the shrub data.-- Fix the Code --
This is a follow-up to Shrub Volume Aggregation. If you haven’t already downloaded the shrub volume data do so now and store it in your
data
directory.The following code is supposed to import the shrub volume data and calculate the average shrub volume for each experiment and, separately, for each site
read.csv("data/shrub-volume-experiment.csv") shrub_data %>% mutate(volume = length * width * height) %>% group_by(site) %>% summarize(mean_volume = max(volume)) shrub_data %>% mutate(volume = length * width * height) group_by(experiment) %>% summarize(mean_volume = mean(volume))
- Fix the errors in the code so that it does what it’s supposed to
- Add a comment to the top of the code explaining what it does
- In a text file, discuss how you know that your fixed version of the code is right and how you would try to make sure it was right if the data file was thousands of lines long
-- Link to Databases --
Let’s access an SQL database directly from R using
dplyr
.Either use an existing copy of the
portal_mammals.sqlite
database or download a new copy. You should be able to link to thesurveys
table in the database using:portaldb <- src_sqlite("portal_mammals.sqlite") surveys <- tbl(portaldb, "surveys")
surveys
will be atbl_sqlite
, which means that the table remains external to the R environment. Also, we won’t need to worry about it printing out huge numbers of rows when we look at it.- Use the
nrow()
function to determine how many records are represented insurveys
. - Select the
year
,month
,day
, andspecies_id
columns in that order. - Create a new data frame with the
year
,species_id
, and weight in kilograms of each individual, with no null weights. - Use the
distinct()
function to print thespecies_id
for each species in the dataset that has been weighed.
- Use the
-- Collect from Databases --
This is a follow-up to Link to Databases.
Make sure you have an existing copy of the
portal_mammals.sqlite
database or download a new copy.Notice how the solution to Link to Databases task 4 is a
<derived table>
and that the number of rows is unknown (??
).- Take this
<derived table>
and make it a local data frame usingcollect()
and determine how many rows it has. - Calculate the average size of a Neotoma Albigula (
NL
) insurveys
and store the result in a local data frame. - Create a local data frame with the number of individuals counted in each year
of the study. If you don’t know how to count things using
dplyr
, take another look at thedplyr
vignette. Vignettes are a great way to reference packages in R if you don’t know the name of a function you need but you know the package has that capability. In this case, you will want to search for ‘count’.
Want a challenge?: Create a local data frame containing the average size of each rodent species for individuals captured on the
[click here for output]Control
plots. You can do this by either creating a connection to thespecies
andplots
tables and usinginner_join
indplyr
or by writing the query inSQL
and usingtbl(portaldb, sql(query))
, wherequery
is a string containing the SQL you want to run. Better yet try doing it both ways.- Take this