2.5 Introduction to R

R is a free, open-source programming language and environment designed for statistical computing and data visualization. It is one of the most widely used tools in the biological and ecological sciences, and has become a standard for biodiversity data analysis. Unlike spreadsheet software, R allows you to write reproducible analysis workflows, meaning that every step of your analysis is documented, repeatable, and transparent.

In this tutorial you will take your first steps in R: setting up a project, organizing your files, learning how to work with data objects, and performing a complete small analysis using real plant occurrence data from the Global Biodiversity Information Facility (GBIF). Everything we do here uses base R, no additional packages are needed until the final GBIF exercise.

2.5.1 Learning Objectives

After completing this exercise you will be able to:

Create an R Project and a Quarto document, and explain the advantages of each
Set up a reproducible folder structure from within R
Navigate directories and work with basic R objects and data frames
Import and inspect a CSV file using head(), str(), summary(), names(), and colnames()
Subset and index data frames with brackets and logical conditions
Apply unique() to explore categorical variables
Save processed data as a CSV file
Download GBIF occurrence records using the rgbif package and perform basic exploratory analysis

2.5.2 Recomended readings

Chamberlain S, Barve V, Mcglinn D, Oldoni D, Desmet P, Geffert L & Ram K (2024) rgbif: Interface to the Global Biodiversity Information Facility API. R package. https://docs.ropensci.org/rgbif/
GBIF (2024) What is GBIF? https://www.gbif.org/what-is-gbif
Quarto (2024) Get Started. https://quarto.org/docs/get-started/

2.5.3 Tutorial

2.5.3.1 R Projects

2.5.3.1.1 What is an R Project?

An R Project (.Rproj file) is a container that links a working directory to a set of project-specific settings in RStudio. When you open an R Project, R automatically sets the working directory to the project folder. This makes file paths short, portable, and consistent across computers, which is a cornerstone of reproducible research.

Without an R Project, you need to manually call setwd("some/very/long/path/on/my/computer") at the top of every script. That path will break as soon as you move the folder or share it with a colleague.

2.5.3.1.2 Working directory and `setwd()`

The working directory is the folder on your computer where R looks for files and saves outputs by default. You can check it at any time with getwd(), and change it manually with setwd():

# Show the current working directory
getwd()

## [1] "C:/Users/ploti/Dropbox/Work/Teaching/02_Biodiversity of plants/R_exercise/bsc-biodiv-der-pflanzen"

# Change it manually (example, no need to run this now)
# setwd("C:/Users/YourName/Documents/my_project")

The problem with setwd() is that the path is unique to your computer. If you send your script to a colleague or move the folder, the path breaks and the script stops working. For this reason, we will not use setwd() in this course. Instead, we use R Projects (see below), which set the working directory automatically and portably.

2.5.3.1.3 Creating an R Project

In RStudio:

Go to File → New Project…
Choose New Directory → New Project
Give it a name (e.g., plants_are_fun) and choose a location
Click Create Project

RStudio will restart and the working directory is automatically set to the project folder, no setwd() needed. You can confirm it:

getwd()

## [1] "C:/Users/ploti/Dropbox/Work/Teaching/02_Biodiversity of plants/R_exercise/bsc-biodiv-der-pflanzen"

2.5.3.2 Setting up a project folder structure

Good data management starts with a clear folder structure. We will create three folders inside our project directory directly from R.

2.5.3.2.1 Creating folders with R

We will first create an R scrpt file to code. You can create it from the menu in File → New File → R script, then name the file and save it. We will write the code in here.

Let’s start with the first function:

Functions look like a word or combination of words, no space in between, followed by a parenthesis. The words inside the parenthesis are the arguments. To check what the function does and read the explanation of the arguments, use ?function_name to open the help panel.

The function dir.create() creates a new folder. The argument showWarnings = FALSE silences the warning if the folder already exists.

#?dir.create

# Create project sub-folders
dir.create("data",    showWarnings = FALSE)
dir.create("plots",   showWarnings = FALSE)
dir.create("output",  showWarnings = FALSE)

Your project should now look like this:

my_project/
├── data/       ← raw input data (CSV files, shapefiles, …)
├── plots/      ← figures and maps you produce
├── output/     ← processed data and result tables
└── my_cool_project.R

2.5.3.3 Basic R objects

Before loading real data, let us get comfortable with the basic building blocks of R.

2.5.3.3.1 Variables

You assign a value to a variable with the assignment operator <-.

# Assign a number
x <- 2026

# Assign a character string
my_name <- "Maria"

# Print by typing the variable name
x

## [1] 2026

my_name

## [1] "Maria"

2.5.3.3.2 Vectors

A vector is a sequence of values of the same type, created with c() (combine).

# A numeric vector
heights <- c(1.2, 0.8, 1.5, 1.1, 2.0)

# A character vector
species <- c("Viola odorata", "Primula vulgaris", "Ranunculus acris")

# Check the type 
class(heights)

## [1] "numeric"

class(species)

## [1] "character"

length(heights)

## [1] 5

2.5.3.3.3 Data frames

A data frame is a table where each column is a vector. This is the most important data structure for data analysis in R.

# Create a small data frame manually
df <- data.frame(
  species = c("Viola odorata", "Primula vulgaris", "Ranunculus acris"),
  family  = c("Violaceae", "Primulaceae", "Ranunculaceae"),
  year    = c(1987, 2001, 1965)
)

# Print it
df

##            species        family year
## 1    Viola odorata     Violaceae 1987
## 2 Primula vulgaris   Primulaceae 2001
## 3 Ranunculus acris Ranunculaceae 1965

2.5.3.4 Importing data

Real data usually comes as a CSV file. We load it with read.csv().

Note: read.csv() is base R. It expects comma-separated values and treats the first row as column names by default.

Download the file my_records.csv from ILIAS under the practice folder Introduction to R

# Read a CSV file from the data/ folder
records <- read.csv("data/my_records.csv")

# We can remove unnecessary columns to make the data easier to handle
# We are using indexation in the following line, see how to use it below
records <- records[, c("key", "taxonID", "family", "genus", "species", "scientificName", "decimalLatitude", "decimalLongitude", "taxonRank", "basisOfRecord", "year", "country", "countryCode", "stateProvince", "locality", "iucnRedListCategory",  "individualCount", "occurrenceStatus")]

2.5.3.4.1 Inspecting the data

Once loaded, always start by looking at the data before doing anything else.

# First 6 rows
head(records)

##          key      taxonID         family      genus               species
## 1  182368733         <NA>     Rhamnaceae   Frangula        Frangula alnus
## 2 1291505203         <NA>     Vireonidae Pteruthius Pteruthius aenobarbus
## 3 1826325281         <NA>    Geometridae   Agriopis   Agriopis marginaria
## 4 2238275141        15086 Strophariaceae  Hypholoma Hypholoma subericaeum
## 5 2973785847         <NA>      Bufonidae   Rhinella                  <NA>
## 6 2979435480 BOLD:ACR5517      Oonopidae       <NA>                  <NA>
##                           scientificName decimalLatitude decimalLongitude
## 1        Frangula dodonei subsp. dodonei        47.66690          2.26047
## 2 Pteruthius aenobarbus (Temminck, 1836)              NA               NA
## 3        Agriopis marginaria (Fabricius)        52.93230         -6.23110
## 4     Hypholoma subericaeum (Fr.) Kühner        56.38486          9.89984
## 5               Rhinella Fitzinger, 1826       -20.44553        -41.87311
## 6                           BOLD:ACR5517        10.76060        -85.33500
##    taxonRank      basisOfRecord year                          country
## 1 SUBSPECIES        OBSERVATION 2026                           France
## 2    SPECIES PRESERVED_SPECIMEN 2026 Lao People’s Democratic Republic
## 3    SPECIES PRESERVED_SPECIMEN 2026                          Ireland
## 4    SPECIES PRESERVED_SPECIMEN 2026                          Denmark
## 5      GENUS PRESERVED_SPECIMEN 2026                           Brazil
## 6   UNRANKED    MATERIAL_SAMPLE 2026                       Costa Rica
##   countryCode       stateProvince                   locality
## 1          FR                <NA>                      ISDES
## 2          LA                <NA>              Xieng-Khowang
## 3          IE             Wicklow         Township: Rathdrum
## 4          DK                <NA>              Langå Egeskov
## 5          BR        Minas Gerais Parque Nacional do Caparaó
## 6          CR Guanacaste Province                     PL12-8
##   iucnRedListCategory individualCount occurrenceStatus
## 1                <NA>              NA          PRESENT
## 2                  LC              NA          PRESENT
## 3                <NA>               1          PRESENT
## 4                <NA>              NA          PRESENT
## 5                <NA>               1          PRESENT
## 6                <NA>              NA          PRESENT

# Last 6 rows
tail(records)

##             key      taxonID        family        genus species scientificName
## 995  3442797880 BOLD:ACD6520 Ichneumonidae  Stenomacrus    <NA>   BOLD:ACD6520
## 996  3442797881 BOLD:AEJ2062  Chironomidae         <NA>    <NA>   BOLD:AEJ2062
## 997  3442797882 BOLD:ACT2121   Psychodidae         <NA>    <NA>   BOLD:ACT2121
## 998  3442797883 BOLD:ACT2121   Psychodidae         <NA>    <NA>   BOLD:ACT2121
## 999  3442797884 BOLD:ACE1029     Sciaridae Pseudosciara    <NA>   BOLD:ACE1029
## 1000 3442798206 BOLD:ACR5891  Bostrichidae         <NA>    <NA>   BOLD:ACR5891
##      decimalLatitude decimalLongitude taxonRank   basisOfRecord year    country
## 995          10.7606         -85.3350  UNRANKED MATERIAL_SAMPLE 2026 Costa Rica
## 996          10.7606         -85.3350  UNRANKED MATERIAL_SAMPLE 2026 Costa Rica
## 997          10.7606         -85.3350  UNRANKED MATERIAL_SAMPLE 2026 Costa Rica
## 998          10.7606         -85.3350  UNRANKED MATERIAL_SAMPLE 2026 Costa Rica
## 999          10.7606         -85.3350  UNRANKED MATERIAL_SAMPLE 2026 Costa Rica
## 1000         10.7634         -85.3347  UNRANKED MATERIAL_SAMPLE 2026 Costa Rica
##      countryCode       stateProvince locality iucnRedListCategory
## 995           CR Guanacaste Province   PL12-8                <NA>
## 996           CR Guanacaste Province   PL12-8                <NA>
## 997           CR Guanacaste Province   PL12-8                <NA>
## 998           CR Guanacaste Province   PL12-8                <NA>
## 999           CR Guanacaste Province   PL12-8                <NA>
## 1000          CR Guanacaste Province   PL12-2                <NA>
##      individualCount occurrenceStatus
## 995               NA          PRESENT
## 996               NA          PRESENT
## 997               NA          PRESENT
## 998               NA          PRESENT
## 999               NA          PRESENT
## 1000              NA          PRESENT

# Structure: column names, types, and first values
str(records)

## 'data.frame':    1000 obs. of  18 variables:
##  $ key                : num  1.82e+08 1.29e+09 1.83e+09 2.24e+09 2.97e+09 ...
##  $ taxonID            : chr  NA NA NA "15086" ...
##  $ family             : chr  "Rhamnaceae" "Vireonidae" "Geometridae" "Strophariaceae" ...
##  $ genus              : chr  "Frangula" "Pteruthius" "Agriopis" "Hypholoma" ...
##  $ species            : chr  "Frangula alnus" "Pteruthius aenobarbus" "Agriopis marginaria" "Hypholoma subericaeum" ...
##  $ scientificName     : chr  "Frangula dodonei subsp. dodonei" "Pteruthius aenobarbus (Temminck, 1836)" "Agriopis marginaria (Fabricius)" "Hypholoma subericaeum (Fr.) Kühner" ...
##  $ decimalLatitude    : num  47.7 NA 52.9 56.4 -20.4 ...
##  $ decimalLongitude   : num  2.26 NA -6.23 9.9 -41.87 ...
##  $ taxonRank          : chr  "SUBSPECIES" "SPECIES" "SPECIES" "SPECIES" ...
##  $ basisOfRecord      : chr  "OBSERVATION" "PRESERVED_SPECIMEN" "PRESERVED_SPECIMEN" "PRESERVED_SPECIMEN" ...
##  $ year               : int  2026 2026 2026 2026 2026 2026 2026 2026 2026 2026 ...
##  $ country            : chr  "France" "Lao People’s Democratic Republic" "Ireland" "Denmark" ...
##  $ countryCode        : chr  "FR" "LA" "IE" "DK" ...
##  $ stateProvince      : chr  NA NA "Wicklow" NA ...
##  $ locality           : chr  "ISDES" "Xieng-Khowang" "Township: Rathdrum" "Langå Egeskov" ...
##  $ iucnRedListCategory: chr  NA "LC" NA NA ...
##  $ individualCount    : int  NA NA 1 NA 1 NA NA NA NA NA ...
##  $ occurrenceStatus   : chr  "PRESENT" "PRESENT" "PRESENT" "PRESENT" ...

# Statistical summary of each column
summary(records)

##       key              taxonID             family             genus          
##  Min.   :1.824e+08   Length:1000        Length:1000        Length:1000       
##  1st Qu.:3.442e+09   Class :character   Class :character   Class :character  
##  Median :3.442e+09   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :3.422e+09                                                           
##  3rd Qu.:3.442e+09                                                           
##  Max.   :3.443e+09                                                           
##                                                                              
##    species          scientificName     decimalLatitude  decimalLongitude
##  Length:1000        Length:1000        Min.   :-22.89   Min.   :-85.34  
##  Class :character   Class :character   1st Qu.: 10.76   1st Qu.:-85.33  
##  Mode  :character   Mode  :character   Median : 10.76   Median :-85.33  
##                                        Mean   : 10.83   Mean   :-84.83  
##                                        3rd Qu.: 10.76   3rd Qu.:-85.33  
##                                        Max.   : 56.38   Max.   : 73.75  
##                                        NA's   :1        NA's   :1       
##   taxonRank         basisOfRecord           year        country         
##  Length:1000        Length:1000        Min.   :2026   Length:1000       
##  Class :character   Class :character   1st Qu.:2026   Class :character  
##  Mode  :character   Mode  :character   Median :2026   Mode  :character  
##                                        Mean   :2026                     
##                                        3rd Qu.:2026                     
##                                        Max.   :2026                     
##                                                                         
##  countryCode        stateProvince        locality         iucnRedListCategory
##  Length:1000        Length:1000        Length:1000        Length:1000        
##  Class :character   Class :character   Class :character   Class :character   
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character   
##                                                                              
##                                                                              
##                                                                              
##                                                                              
##  individualCount occurrenceStatus  
##  Min.   :1       Length:1000       
##  1st Qu.:1       Class :character  
##  Median :1       Mode  :character  
##  Mean   :1                         
##  3rd Qu.:1                         
##  Max.   :1                         
##  NA's   :998

str() is particularly useful, it tells you at a glance how many rows and columns you have, what type each column is (numeric, character, factor), and what the first few values look like.

2.5.3.4.2 Column names

# Get column names (both are equivalent)
names(records)

##  [1] "key"                 "taxonID"             "family"             
##  [4] "genus"               "species"             "scientificName"     
##  [7] "decimalLatitude"     "decimalLongitude"    "taxonRank"          
## [10] "basisOfRecord"       "year"                "country"            
## [13] "countryCode"         "stateProvince"       "locality"           
## [16] "iucnRedListCategory" "individualCount"     "occurrenceStatus"

colnames(records)

##  [1] "key"                 "taxonID"             "family"             
##  [4] "genus"               "species"             "scientificName"     
##  [7] "decimalLatitude"     "decimalLongitude"    "taxonRank"          
## [10] "basisOfRecord"       "year"                "country"            
## [13] "countryCode"         "stateProvince"       "locality"           
## [16] "iucnRedListCategory" "individualCount"     "occurrenceStatus"

# Number of rows and columns
nrow(records)

## [1] 1000

ncol(records)

## [1] 18

# Both at once
dim(records)

## [1] 1000   18

2.5.3.5 Indexing and subsetting

2.5.3.5.1 Bracket notation: `[row, column]`

Data frames are indexed with [rows, columns]. Leaving a position blank means “all”.

# First row, all columns
records[1, ]

##         key taxonID     family    genus        species
## 1 182368733    <NA> Rhamnaceae Frangula Frangula alnus
##                    scientificName decimalLatitude decimalLongitude  taxonRank
## 1 Frangula dodonei subsp. dodonei         47.6669          2.26047 SUBSPECIES
##   basisOfRecord year country countryCode stateProvince locality
## 1   OBSERVATION 2026  France          FR          <NA>    ISDES
##   iucnRedListCategory individualCount occurrenceStatus
## 1                <NA>              NA          PRESENT

# First 3 rows, all columns
records[1:3, ]

##          key taxonID      family      genus               species
## 1  182368733    <NA>  Rhamnaceae   Frangula        Frangula alnus
## 2 1291505203    <NA>  Vireonidae Pteruthius Pteruthius aenobarbus
## 3 1826325281    <NA> Geometridae   Agriopis   Agriopis marginaria
##                           scientificName decimalLatitude decimalLongitude
## 1        Frangula dodonei subsp. dodonei         47.6669          2.26047
## 2 Pteruthius aenobarbus (Temminck, 1836)              NA               NA
## 3        Agriopis marginaria (Fabricius)         52.9323         -6.23110
##    taxonRank      basisOfRecord year                          country
## 1 SUBSPECIES        OBSERVATION 2026                           France
## 2    SPECIES PRESERVED_SPECIMEN 2026 Lao People’s Democratic Republic
## 3    SPECIES PRESERVED_SPECIMEN 2026                          Ireland
##   countryCode stateProvince           locality iucnRedListCategory
## 1          FR          <NA>              ISDES                <NA>
## 2          LA          <NA>      Xieng-Khowang                  LC
## 3          IE       Wicklow Township: Rathdrum                <NA>
##   individualCount occurrenceStatus
## 1              NA          PRESENT
## 2              NA          PRESENT
## 3               1          PRESENT

# First 3 rows, second column
records[3 , 2]

## [1] NA

# Row 1, column 3
records[1, 3]

## [1] "Rhamnaceae"

2.5.3.5.2 The `$` operator

Use $ to access a single column by name. The result is a vector.

# Access the species column (defined in DarwinCore terminology as 'scientificName'), first 10 rows 
records$scientificName[1:10]

##  [1] "Frangula dodonei subsp. dodonei"       
##  [2] "Pteruthius aenobarbus (Temminck, 1836)"
##  [3] "Agriopis marginaria (Fabricius)"       
##  [4] "Hypholoma subericaeum (Fr.) Kühner"    
##  [5] "Rhinella Fitzinger, 1826"              
##  [6] "BOLD:ACR5517"                          
##  [7] "BOLD:ACD1886"                          
##  [8] "BOLD:ACR5517"                          
##  [9] "BOLD:AEE2514"                          
## [10] "BOLD:ACR5517"

# Access the 'year' column
records$year[1:10]

##  [1] 2026 2026 2026 2026 2026 2026 2026 2026 2026 2026

This is a messy dataset. That is the reason we must ALWAYS perform cleaning

2.5.3.5.3 Unique values

unique() returns the distinct values in a vector (useful for exploring categorical columns).

# Which countries appear in the data?
unique(records$country)

## [1] "France"                           "Lao People’s Democratic Republic"
## [3] "Ireland"                          "Denmark"                         
## [5] "Brazil"                           "Costa Rica"                      
## [7] "India"

# Which basis-of-record types are present?
unique(records$basisOfRecord)

## [1] "OBSERVATION"        "PRESERVED_SPECIMEN" "MATERIAL_SAMPLE"   
## [4] "HUMAN_OBSERVATION"

2.5.3.5.4 Logical subsetting (filtering)

You can filter rows by putting a logical condition in the row position.

# Keep only rows where country is "Brazil"
brazil <- records[records$country == "Brazil", ]

# Keep only rows from 2000 onwards
recent <- records[records$year >= 2000, ]

# Combine conditions with & (AND) or | (OR)
recent_brazil <- records[records$year >= 2000 & records$country == "Brazil", ]

2.5.3.6 Saving data

2.5.3.6.1 Save a data frame as CSV

Use write.csv() to write a data frame to disk. Setting row.names = FALSE prevents R from adding a redundant row-number column.

write.csv(recent_brazil, "output/Brazil_recent_records.csv", row.names = FALSE)

2.5.3.6.2 Save and load R objects

To save R objects (e.g., a processed data frame) in R’s native binary format, use save(). This preserves all data types exactly and is faster to reload than CSV.

# Save one or more objects
save(brazil, file = "output/Brazil_records.RData")

# Reload them in a future session
load("output/Brazil_records.RData")

2.5.3.7 The use of packages and libraries: downloading GBIF data with `rgbif`

Last class we learned what is GBIF and the type of data it gathers. We can download the data directly on the web but the R package rgbif provides direct access to the GBIF API.

We need to first install and load the packages

# Install once (remove the # to run)
# install.packages("rgbif")

library(rgbif)

## Warning: package 'rgbif' was built under R version 4.5.2

Find the taxon key: GBIF identifies taxa with numeric keys. Use name_backbone() to look up the key for a name of interest.

# Look up for the taxon
taxon_info <- name_backbone(name = "Epidendrum", rank = "genus")
taxon_info

## # A tibble: 1 × 26
##   usageKey scientificName canonicalName authorship rank  status   type      
## * <chr>    <chr>          <chr>         <chr>      <chr> <chr>    <chr>     
## 1 5297341  Epidendrum L.  Epidendrum    L.         GENUS ACCEPTED SCIENTIFIC
## # ℹ 19 more variables: formattedName <chr>, matchType <chr>, confidence <int>,
## #   timeTaken <int>, kingdom <chr>, phylum <chr>, class <chr>, order <chr>,
## #   family <chr>, genus <chr>, kingdomKey <chr>, phylumKey <chr>,
## #   classKey <chr>, orderKey <chr>, familyKey <chr>, genusKey <chr>,
## #   value <lgl>, verbatim_name <chr>, verbatim_rank <chr>

# usageKey is for species (scientificName) but you also have the taxon key for the family, genus, order, etc

# Extract the key
genusKey <- taxon_info$genusKey
genusKey

## [1] "5297341"

Download occurrence records: occ_search() queries the GBIF occurrence API. We set a limit to cap the number of records returned (the maximum is 100,000 per call for larger datasets use occ_download()).

# Download up to 1000 records
gbif_raw <- occ_search(
  taxonKey = genusKey,
  limit    = 1000
)

# The actual data frame is stored in the $data element
records <- gbif_raw$data

First look at the data

head(records)

## # A tibble: 6 × 127
##   key        scientificName   decimalLatitude decimalLongitude issues datasetKey
##   <chr>      <chr>                      <dbl>            <dbl> <chr>  <chr>     
## 1 5938085880 Epidendrum radi…            9.84            -83.9 cdc,c… 50c9509d-…
## 2 5938101363 Epidendrum cent…           10.3             -84.8 cdc,c… 50c9509d-…
## 3 5938125049 Epidendrum radi…            8.78            -82.4 cdc,c… 50c9509d-…
## 4 5938130699 Epidendrum fimb…            5.08            -75.4 cdc,c… 50c9509d-…
## 5 5938145759 Epidendrum fimb…            6.29            -75.5 cdc,c… 50c9509d-…
## 6 5938212569 Epidendrum park…           16.0             -96.5 cdc,c… 50c9509d-…
## # ℹ 121 more variables: publishingOrgKey <chr>, installationKey <chr>,
## #   hostingOrganizationKey <chr>, publishingCountry <chr>, protocol <chr>,
## #   lastCrawled <chr>, lastParsed <chr>, crawlId <int>, basisOfRecord <chr>,
## #   occurrenceStatus <chr>, taxonKey <int>, kingdomKey <int>, phylumKey <int>,
## #   classKey <int>, orderKey <int>, familyKey <int>, genusKey <int>,
## #   speciesKey <int>, acceptedTaxonKey <int>, acceptedScientificName <chr>,
## #   kingdom <chr>, phylum <chr>, order <chr>, family <chr>, genus <chr>, …

dim(records) #it has many columns, you can just use the ones you prefer with the indexing brackets as shown above to select only the desired

## [1] 1000  127

Filter only herbarium specimens: occurrence records come from different sources (basisOfRecord). For botanical collection data, we are often most interested in preserved specimens (physical herbarium sheets).

# Check which record types are present
unique(records$basisOfRecord)

## [1] "HUMAN_OBSERVATION"  "PRESERVED_SPECIMEN"

# We should keep only preserved specimens since they have the most  reliable identification
preserved <- records[records$basisOfRecord == "PRESERVED_SPECIMEN", ]

Which species has the most records?

table() counts occurrences of each unique value. sort() with decreasing = TRUE puts the most common first.

species_counts <- table(records$scientificName)
sort(species_counts, decreasing = TRUE)[1:10]

## 
##         Epidendrum radicans Pav. ex Lindl. 
##                                        120 
##                 Epidendrum conopseum R.Br. 
##                                         80 
##                Epidendrum fimbriatum Kunth 
##                                         65 
##             Epidendrum amphistomum A.Rich. 
##                                         55 
##                   Epidendrum rigidum Jacq. 
##                                         52 
##                 Epidendrum nocturnum Jacq. 
##                                         46 
##           Epidendrum centropetalum Rchb.f. 
##                                         36 
## Epidendrum arachnoglossum Rchb.f. ex André 
##                                         32 
##                      Epidendrum ciliare L. 
##                                         31 
##           Epidendrum strobiliferum Rchb.f. 
##                                         30

What is the earliest and latest collection year?

min(records$year, na.rm = TRUE)

## [1] 2026

max(records$year, na.rm = TRUE)

## [1] 2026

Those are recent records but if you have more, you now know how to filter for records from specific years

Notice the na.rm = TRUE argument. It tells R to ignore missing values (NA). Without it, min() would return NA if any year is missing.

How many records are missing a year?

sum(is.na(records$year))

## [1] 0

Filter records for a specific country

Country codes in GBIF follow the ISO 3166-1 alpha-2 standard (e.g., "DE" for Germany, "CO" for Colombia, "US" for the United States).

# Which countries are represented?
unique(records$countryCode)

##  [1] "CR" "PA" "CO" "MX" "PE" "US" "BR" "AU" "EC" "GP" "HN" "GF" "VI" "GT" NA  
## [16] "BQ" "PR" "GD" "BO" "DM" "KE" "ES" "BZ" "NZ" "SH" "NI" "MS" "MQ" "TZ" "KN"
## [31] "VC" "LC" "VE"

# Filter to Colombia
co_records <- records[records$countryCode == "CO", ]
nrow(co_records)

## [1] 190

head(co_records)

## # A tibble: 6 × 127
##   key        scientificName   decimalLatitude decimalLongitude issues datasetKey
##   <chr>      <chr>                      <dbl>            <dbl> <chr>  <chr>     
## 1 5938130699 Epidendrum fimb…            5.08            -75.4 cdc,c… 50c9509d-…
## 2 5938145759 Epidendrum fimb…            6.29            -75.5 cdc,c… 50c9509d-…
## 3 5938279268 Epidendrum fimb…            5.08            -75.4 cdc,c… 50c9509d-…
## 4 6129828395 Epidendrum fimb…            4.69            -75.6 cdc,c… 50c9509d-…
## 5 6129829570 Epidendrum fimb…            6.73            -75.7 cdc,c… 50c9509d-…
## 6 6129854781 Epidendrum fimb…            6.33            -75.6 cdc,c… 50c9509d-…
## # ℹ 121 more variables: publishingOrgKey <chr>, installationKey <chr>,
## #   hostingOrganizationKey <chr>, publishingCountry <chr>, protocol <chr>,
## #   lastCrawled <chr>, lastParsed <chr>, crawlId <int>, basisOfRecord <chr>,
## #   occurrenceStatus <chr>, taxonKey <int>, kingdomKey <int>, phylumKey <int>,
## #   classKey <int>, orderKey <int>, familyKey <int>, genusKey <int>,
## #   speciesKey <int>, acceptedTaxonKey <int>, acceptedScientificName <chr>,
## #   kingdom <chr>, phylum <chr>, order <chr>, family <chr>, genus <chr>, …

2.5.4 Tasks

Apply what you have learned today. Save your answers (code + short written explanations) in the R document and submit it on ILIAS at the specified date:

1. Download occurrence records from GBIF for a plant taxon of your choice (a family, genus, or species you find interesting). Use name_backbone() to find the taxon key and occ_search() to download at least 500 records. Save the raw data as a CSV in your data/ folder.

2. Inspect your dataset: how many rows and columns does it have? What are the column names? Which columns contain missing values, and how many?

3. Filter the dataset to preserved specimens only. How many records remain? What percentage of the original download are preserved specimens?

4. Answer the following questions using base R functions:

Which family (or genus, if you searched at genus level) has the most records?
What is the earliest and latest collection year in the preserved specimen subset?
How many records come from each country? Show the top 5.

5. Filter the preserved specimens to a single country of your choice and save the result as a CSV in your output/ folder.

6. Reflection (3–5 sentences): What was the most surprising thing you found in your GBIF data? Was there anything unexpected about the number of missing values, the time range of records, or the geographic distribution?

2.5.5 Extra 1: using GBIF API

Proper citation of GBIF data is important: it gives credit to the institutions and collectors who digitized the specimens, and it makes your analysis reproducible by pointing other researchers to the exact dataset you used.

Two ways to download, two ways to cite:

occ_search() is convenient for exploration, but it does not register a download on GBIF’s servers and therefore has no DOI. For publications and assignments where reproducibility matters, you should use occ_download() instead. This submits a formal request to GBIF, which processes the data, assigns a permanent DOI, and keeps the file available for download indefinitely.

occ_download() requires a free GBIF account. Register at gbif.org and then provide your credentials once per session:

Code is commented. When adding your username and password, uncomment and run.

# Set your GBIF account credentials (do this once per session)
# options(
#   gbif_user  = "your_username",
#   gbif_pwd   = "your_password",
#   gbif_email = "your_email@example.com"
# )

Then submit the download request using predicates, filters that define exactly which records you want:

# Request a download for preserved specimens of Epidendrum (we saved the taxonkey in an object before)
# dl <- occ_download(
#   pred("taxonKey", order_key),
#   pred("basisOfRecord", "PRESERVED_SPECIMEN")
# )

# See the other arguments in ?occ_download

GBIF processes the request on its servers (usually takes 1–10 minutes). Use occ_download_wait() to pause R until it is ready:

# Wait until the download is ready
# occ_download_wait(dl)

# Load the downloaded data into R
# records_dl <- occ_download_get(dl) |> occ_download_import()

Finding the DOI

Once the download is complete, retrieve its metadata to find the DOI:

# Get metadata for the download
# meta <- occ_download_meta(dl)

# Print the DOI
# meta$doi

The DOI looks something like 10.15468/dl.xxxxxx and links to a permanent page on GBIF where anyone can access the exact same dataset. Check citation in your account under DOWNLOADS. It looks something like this:

GBIF.org (2024) GBIF Occurrence Download. https://doi.org/10.15468/dl.xxxxxx

For any formal submission (report, thesis, paper), always use occ_download() so your data has a citable DOI.

2.5.6 Extra 2: filtering native range (rWCVP package)

Plants of the World Online (POWO) provides native distribution data for all vascular plants. The rWCVP package gives direct access to this data in R, using the World Checklist of Vascular Plants (WCVP) as its backbone. We can use it to remove GBIF records that fall outside a taxon’s native range.

The key idea: wcvp_distribution() returns the native range as a spatial polygon (sf object). We convert our occurrence records to spatial points and keep only those that fall inside the polygon.

# install.packages("remotes")
# library(remotes)
# remotes::install_github("matildabrown/rWCVP")
# library(rWCVP)
# remotes::install_github('matildabrown/rWCVPdata')
# library(rWCVPdata)
# install.packages("sf")
# library(sf)
# 
# # Explore distribution polygon for our genus from WCVP/POWO.
# distribution <- wcvp_distribution("Epidendrum", taxon_rank = "genus")
# 
# # global map
# wcvp_distribution_map(distribution)
# 
# # zoomed-in map
# wcvp_distribution_map(distribution, crop_map = TRUE)
# 
# # Get the native distribution only. Setting introduced = FALSE and extinct = FALSE keeps only the native range.
# native_range <- wcvp_distribution("Epidendrum", taxon_rank = "genus",
#                                   introduced = FALSE, extinct = FALSE,
#                                   location_doubtful = FALSE)
# 
# # Quick visual check of the native range
# wcvp_distribution_map(native_range)
# 
# # Convert GBIF records to spatial points (requires coordinates to be present)
# records_sf <- st_as_sf(
#   records[!is.na(records$decimalLongitude) & !is.na(records$decimalLatitude), ],
#   coords = c("decimalLongitude", "decimalLatitude"),
#   crs    = st_crs(4326)
# )
# 
# # Check which points fall inside the native range polygon.
# # st_union() merges all range polygons into one before intersecting.
# # The result is a logical vector: TRUE = point is inside the native range.
# records_sf$native <- st_intersects(records_sf, st_union(native_range),
#                                    sparse = FALSE)[, 1]
# 
# # Filter to native records only using base R subsetting
# native_records <- records_sf[records_sf$native, ]
# nrow(records)
# nrow(native_records)
# 
# # After this, we check the class of our dataset
# class(native_records)
# 
# # Your dataset now has the latotude and longitude in the column geometry
# # To convert your dataset back to a dataframe, you use 'st_drop_geometry'
# library("tidyverse") # this package is going to be explained next class
# 
# native_records <- native_records %>%
#   mutate(
#     decimalLongitude = st_coordinates(.)[, 1],
#     decimalLatitude = st_coordinates(.)[, 2]
#   ) %>%
#   st_drop_geometry()

Note: The WCVP polygons use TDWG Level 3 regions (coarse political units), so coastal or border records may be flagged as outside the range even if they are biologically native. You can add a small buffer with st_buffer(st_union(native_range), 0.009) (≈ 1 km at the equator) to be less strict.

2.5.7 Extra 3: perform geographic cleaning (CoordinateCleaner package)

Even after filtering to the native range, GBIF records often contain coordinate errors: points placed at country centroids, at biodiversity institution locations, at (0, 0), or with swapped latitude/longitude. The CoordinateCleaner package runs a battery of automated tests to flag these issues.

# # install.packages("CoordinateCleaner")
# library(CoordinateCleaner)
# 
# # Run coordinate cleaning tests on the records data frame.
# # Each test returns TRUE (clean) or FALSE (flagged).
# # .summary is TRUE only if ALL tests pass for that record.
# clean_flags <- clean_coordinates(
#   x       = native_records,
#   lon     = "decimalLongitude",
#   lat     = "decimalLatitude",
#   species = "species",
#   tests   = c("capitals",     # point at a national capital
#               "centroids",    # point at a country or province centroid
#               "equal",        # latitude == longitude
#               "gbif",         # point at GBIF headquarters
#               "institutions", # point at a herbarium or museum
#               "zeros")        # coordinates are exactly 0, 0
# )
# 
# # Inspect how many records were flagged by each test
# summary(clean_flags)
# 
# # Keep only records that passed all tests (.summary == TRUE)
# records_clean <- native_records[clean_flags$.summary, ]
# nrow(records_clean)
# nrow(records) - nrow(records_clean)  # number of records removed
# 
# # Save the cleaned dataset as a CSV in the output folder
# write.csv(records_clean, "output/records_clean.csv", row.names = FALSE)

Tip: clean_coordinates() is conservative by default. It is a good idea to inspect flagged records before discarding them. You can plot the flagged vs. clean points to visually verify the results, or relax specific tests by removing them from the tests vector.