Remember to
- download
surveys.csv
.- download
species.csv
.download
portal_mammals.sqlite
.- Consider removing the
dplyr
package so you can demonstrate installing it.
- Linux users: you may not want to do this because the source install is slow
INSTALL WARNING: Some Windows users have issues installing
dplyr
receiving errors that the associate files cannot be copied. This is an issue with McAfee anti-virus. Streaming protection should be temporarily disabled to allow installation.
Packages
- Main way that reusable code is shared in R
- Combination of code, data, and documentation
- R has a rich ecosystem of packages for data manipulation & analysis
- Download and install packages with the R console:
install.packages("dplyr")
- Using a package:
- Load all of the functions in the package:
library("dplyr")
- Load all of the functions in the package:
Basic dplyr
- Modern data manipulation library for R
- Does a lot of the same things we’ve learned to do in SQL.
- Start a new project (modeling good practice)
surveys <- read.csv("surveys.csv")
- Select:
select(surveys, year, month, day)
- Filter:
filter(surveys, species_id == "DS")
filter(surveys, species_id == "DS", year > 1995)
- Mutate:
mutate(surveys, hindfoot_length_cm = hindfoot_length / 10)
Aggregation
- Group by:
group_by(surveys, species_id)
- Different looking kind of
data.frame
- Source, grouping, and data type information
surveys_by_species <- group_by(surveys, species_id)
- Grouping with
summarize()
:summarize(surveys_by_species, avg_weight = mean(weight))
- Real data problem:
mean(weight)
whenweight
has missing values (NA
)- Returns
NA
mean(weight, na.rm=TRUE)
- Returns
Joins
inner_join
indplyr
works likeJOIN
in SQL
species <- read.csv("species.csv")
combined <- inner_join(surveys, species, by = "species_id")
head(combined)
Pipes
- Combine a series of data manipulation actions
- Intermediate variables
- Step-wise approach
- Can get cumbersome with lots of variable objects in the environment
surveys_DS <- filter(surveys, species_id == "DS")
surveys_DS_by_yr <- group_by(surveys_DS, year)
avg_weight_DS_by_yr <- summarize(surveys_DS_by_yr,
avg_weight = mean(weight, na.rm=TRUE))
- Pipes:
- Operator:
%>%
- Operation:
%>%
takes the output of one command and passes it as input to the next commandx %>% f(y)
translates tof(x, y)
surveys %>% filter(species_id == "DS")
- Operator:
surveys %>%
filter(species_id == "DS") %>%
group_by(year) %>%
summarize(avg_weight = mean(weight, na.rm=TRUE))
Do Fix the Code.
Using dplyr
with databases
- We can also use
dplyr
to access data directly from a database.- No need to export files from the database
- Lets the database do the heavy lifting
- Faster
- No RAM limits
portaldb <- src_sqlite("portal_mammals.sqlite")
surveys <- tbl(portaldb, "surveys")
surveys
query <- "SELECT year, month, day, genus, species
FROM surveys JOIN species
ON surveys.species_id = species.species_id"
tbl(portaldb, sql(query))
- Speed example using Breeding Bird Survey of North America data
- ~85 million cells (>250 MB)
# Loading from SQLite completes instantly
bbs_sqlite <- src_sqlite("bbs.sqlite")
bbs_counts <- tbl(bbs_sqlite, "bbs_counts")
bbs_counts
# Loading from csv takes 30 seconds
bbs_counts_csv <- read.csv("BBS_counts.csv")
- Queries and data manipulation functions return similar results with various
headings (
Source: query
). - Queries and data manipulation results will remain in the external database.
- Use
collect()
to store results in a local data frame (# A tibble
).