Learning Objectives
Following this assignment students should be able to:
- create an SQL database by importing data
- understand the basic query structure of SQL
- execute SQL commands to select, sort, group, and aggregate data
Reading
-
Introduction
-
Basic Queries
Lecture Notes
- Demo Code for Where Students Can Get in the Course
- Introduction to Databases
- Basic Queries
- Aggregation
Exercises
-- Importing Data --
This example will walk you through how to get data that already exists into SQLite.
- Download the main table for the Portal LTREB mammal survey database. It’s kind of large so it might take a few seconds. This database is published as a Data Paper on Ecological Archives, which is generally a great place to look for ecology data.
- Create a new database by clicking on
New Database
in theDatabase
drop down menu. Select a file name, likeportal_mammals.sqlite
, and location. - Click on the
Import
icon. - Click on
Select File
and navigate to where you saved the data file and select it. - Select
CSV
. You’ll notice that you can also import from other SQL or modify theFields separated
orenclosed by
. You’ll want to make sure to selectFirst row contains column names
. - Click
OK
when it asks if you want to modify the data. - Name the table that you are importing into
surveys
. - Identify the type for each field, using the
Data Type
drop-down menus. If it is not obvious if the data type is anINTEGER
orVARCHAR
for each variable, check the metadata. Important: if you specify the wrong data type it can cause some data to not be imported and/or prevent you from doing some kinds of data manipulations. - Select
recordID
as thePrimary Key
and clickOK
. - Click
OK
when it asks if you are sure you want to import the data. - Now import the plots, and species tables.
-- SELECT --
For this and many of the following problems you will create queries that retrieve the relevant information from the Portal small mammal survey database. As you begin to familiarize yourself with the database you will need to know some details regarding what is in this database in order to answer the questions. For example, you may need to know what species is associated with the two character species ID or you may need to know the units for the individual’s weight. This type of information associated with data is called metadata and the metadata for this dataset is available online at Ecological Archives.
- Write a query that displays all of the records for all of the fields (
*
) in the main table. Save it as a view namedall_survey_data
. - We want to generate data for an analysis of body size differences
between males and females of each species. We have decided that we
can ignore the information related to when and where the individuals
were trapped. Create a query that returns all of the necessary
information, but nothing else. Save this as
size_differences_among_sexes_data
.
- Write a query that displays all of the records for all of the fields (
-- WHERE --
A population biologist (Dr. Undomiel) who studies the population dynamics of Dipodomys spectabilis would like to use some data from Portal, but she doesn’t know how to work with large datasets. Being the kind and benevolent person that you are, write a query to extract the data that she needs. She wants only the data for her species of interest, when each individual was trapped, and what sex it was. She doesn’t care about where it was trapped within the site because she is going to analyze the entire site as a whole and she doesn’t care about the size of the individuals. She doesn’t need the species codes because you’re only providing her with the data for one species, and since she isn’t looking at the database itself the two character abbreviation would probably be confusing. Save this query as a view with the name
spectabilis_population_data
.Scroll through the results of your query. Do you notice anything that might be an issue for the scientist to whom you are providing this data? You should! Think about what you should do in this situation…
You decide that to avoid invoking her wrath, you’ll send her a short e-mail* requesting clarification regarding what she would like you to do regarding this complexity. Dr. Undomiel e-mails you back and asks that you create two additional queries so that she can decided what to do about this issue later. She would like you to add a query to the same data as above, but only for cases where the sex is known to be male, and an additional query with the same data, but only where the sex is known to be female. Save these as views with the names
spectabilis_population_data_males
andspectabilis_population_data_females
.*Short for elven-mail
[click here for output] [click here for output] [click here for output]-- ORDER BY --
The graduate students that work at the Portal site are hanging out late one evening drinking… soda pop… and they decide it would be an epically awesome idea to put together a list of the 100 largest rodents ever sampled at the site. Since you’re the resident computer genius they text you, and since you’re up late working and this sounds like a lot more fun than the homework you’re working on (which isn’t really saying much, if you know what I’m saying) you decide you’ll make the list for them.
The rules that the Portal students have come up with (and they did spend a sort of disturbingly long time coming up with these rules; I guess you just had to be there) are:
- The data should include the
species_id
,year
, and theweight
. These columns should be presented in this order. - Individuals should be sorted in descending order with respect to mass.
- Since individuals often have the same mass, ties should be settled by
sorting next by
hindfoot_length
and finally by theyear
.
Since you need to limit this list to the top 100 largest rodents, you’ll need to add the SQL command
[click here for output]LIMIT 100
to the end of the query. Save the final query as100_largest_individuals
.- The data should include the
-- DISTINCT --
Write a query that returns a list of the dates that mammal surveys took place at Portal with no duplicates. Save it as
[click here for output]dates_sampled
.-- Missing Data --
Write a query that returns the
[click here for output]year
,month
,day
,species_id
, andweight
for every record were there is no missing data in any of these fields. Save it asno_missing_data
.-- GROUP BY --
Using GROUP BY, write a query that returns a list of dates on which individuals of the species Dipodomys spectabilis (indicated by the
[click here for output]DS
species code) were trapped (with no duplicates). Sort the list in chronological order (from oldest to newest). Save it asdates_with_dipodomys_spectabilis
.-- COUNT --
Write a query that returns the number of individuals of all known species combined (
[click here for output]total_abundance
) in each year, sorted chronologically. Include the year in the output. Save it astotal_abundance_by_year
.-- SUM --
Write a query that returns the number of individuals of each species captured in each year (
[click here for output]total_abundance
) and thetotal_biomass
of those individuals (the sum of theweight
). The units for biomass should be in kilograms. Include theyear
andspecies_id
in the output. Sort the result chronologically by year and then alphabetically by species. Save asmass_abundance_data
.