Learning Objectives

Following this assignment students should be able to:

  • create an SQL database by importing data
  • understand the basic query structure of SQL
  • execute SQL commands to select, sort, group, and aggregate data

Reading

Lecture Notes

  1. Demo Code for Where Students Can Get in the Course
  2. Introduction to Databases
  3. Basic Queries
  4. Aggregation

Exercises

  1. -- Importing Data --

    This example will walk you through how to get data that already exists into SQLite.

    1. Download the main table for the Portal LTREB mammal survey database. It’s kind of large so it might take a few seconds. This database is published as a Data Paper on Ecological Archives, which is generally a great place to look for ecology data.
    2. Create a new database by clicking on New Database in the Database drop down menu. Select a file name, like portal_mammals.sqlite, and location. ​
    3. Click on the Import icon. ​
    4. Click on Select File and navigate to where you saved the data file and select it. ​
    5. Select CSV. You’ll notice that you can also import from other SQL or modify the Fields separated or enclosed by. You’ll want to make sure to select First row contains column names. ​
    6. Click OK when it asks if you want to modify the data.
    7. Name the table that you are importing into surveys.
    8. Identify the type for each field, using the Data Type drop-down menus. If it is not obvious if the data type is an INTEGER or VARCHAR for each variable, check the metadata. Important: if you specify the wrong data type it can cause some data to not be imported and/or prevent you from doing some kinds of data manipulations.
    9. Select recordID as the Primary Key and click OK.
    10. Click OK when it asks if you are sure you want to import the data.
    11. Now import the plots, and species tables.
    [click here for output] [click here for output] [click here for output]
  2. -- SELECT --

    For this and many of the following problems you will create queries that retrieve the relevant information from the Portal small mammal survey database. As you begin to familiarize yourself with the database you will need to know some details regarding what is in this database in order to answer the questions. For example, you may need to know what species is associated with the two character species ID or you may need to know the units for the individual’s weight. This type of information associated with data is called metadata and the metadata for this dataset is available online at Ecological Archives.

    1. Write a query that displays all of the records for all of the fields (*) in the main table. Save it as a view named all_survey_data.
    2. We want to generate data for an analysis of body size differences between males and females of each species. We have decided that we can ignore the information related to when and where the individuals were trapped. Create a query that returns all of the necessary information, but nothing else. Save this as size_differences_among_sexes_data.
    [click here for output] [click here for output]
  3. -- WHERE --

    A population biologist (Dr. Undomiel) who studies the population dynamics of Dipodomys spectabilis would like to use some data from Portal, but she doesn’t know how to work with large datasets. Being the kind and benevolent person that you are, write a query to extract the data that she needs. She wants only the data for her species of interest, when each individual was trapped, and what sex it was. She doesn’t care about where it was trapped within the site because she is going to analyze the entire site as a whole and she doesn’t care about the size of the individuals. She doesn’t need the species codes because you’re only providing her with the data for one species, and since she isn’t looking at the database itself the two character abbreviation would probably be confusing. Save this query as a view with the name spectabilis_population_data.

    Scroll through the results of your query. Do you notice anything that might be an issue for the scientist to whom you are providing this data? You should! Think about what you should do in this situation…

    You decide that to avoid invoking her wrath, you’ll send her a short e-mail* requesting clarification regarding what she would like you to do regarding this complexity. Dr. Undomiel e-mails you back and asks that you create two additional queries so that she can decided what to do about this issue later. She would like you to add a query to the same data as above, but only for cases where the sex is known to be male, and an additional query with the same data, but only where the sex is known to be female. Save these as views with the names spectabilis_population_data_males and spectabilis_population_data_females.

    *Short for elven-mail

    [click here for output] [click here for output] [click here for output]
  4. -- ORDER BY --

    The graduate students that work at the Portal site are hanging out late one evening drinking… soda pop… and they decide it would be an epically awesome idea to put together a list of the 100 largest rodents ever sampled at the site. Since you’re the resident computer genius they text you, and since you’re up late working and this sounds like a lot more fun than the homework you’re working on (which isn’t really saying much, if you know what I’m saying) you decide you’ll make the list for them.

    The rules that the Portal students have come up with (and they did spend a sort of disturbingly long time coming up with these rules; I guess you just had to be there) are:

    1. The data should include the species_id, year, and the weight. These columns should be presented in this order.
    2. Individuals should be sorted in descending order with respect to mass.
    3. Since individuals often have the same mass, ties should be settled by sorting next by hindfoot_length and finally by the year.

    Since you need to limit this list to the top 100 largest rodents, you’ll need to add the SQL command LIMIT 100 to the end of the query. Save the final query as 100_largest_individuals.

    [click here for output]
  5. -- DISTINCT --

    Write a query that returns a list of the dates that mammal surveys took place at Portal with no duplicates. Save it as dates_sampled.

    [click here for output]
  6. -- Missing Data --

    Write a query that returns the year, month, day, species_id, and weight for every record were there is no missing data in any of these fields. Save it as no_missing_data.

    [click here for output]
  7. -- GROUP BY --

    Using GROUP BY, write a query that returns a list of dates on which individuals of the species Dipodomys spectabilis (indicated by the DS species code) were trapped (with no duplicates). Sort the list in chronological order (from oldest to newest). Save it as dates_with_dipodomys_spectabilis.

    [click here for output]
  8. -- COUNT --

    Write a query that returns the number of individuals of all known species combined (total_abundance) in each year, sorted chronologically. Include the year in the output. Save it as total_abundance_by_year.

    [click here for output]
  9. -- SUM --

    Write a query that returns the number of individuals of each species captured in each year (total_abundance) and the total_biomass of those individuals (the sum of the weight). The units for biomass should be in kilograms. Include the year and species_id in the output. Sort the result chronologically by year and then alphabetically by species. Save as mass_abundance_data.

    [click here for output]