2.5 Introduction to R
R is a free, open-source programming language and environment designed for statistical computing and data visualization. It is one of the most widely used tools in the biological and ecological sciences, and has become a standard for biodiversity data analysis. Unlike spreadsheet software, R allows you to write reproducible analysis workflows, meaning that every step of your analysis is documented, repeatable, and transparent.
In this tutorial you will take your first steps in R: setting up a project, organizing your files, learning how to work with data objects, and performing a complete small analysis using real plant occurrence data from the Global Biodiversity Information Facility (GBIF). Everything we do here uses base R, no additional packages are needed until the final GBIF exercise.
2.5.1 Learning Objectives
After completing this exercise you will be able to:
- Create an R Project and a Quarto document, and explain the advantages of each
- Set up a reproducible folder structure from within R
- Navigate directories and work with basic R objects and data frames
- Import and inspect a CSV file using
head(),str(),summary(),names(), andcolnames() - Subset and index data frames with brackets and logical conditions
- Apply
unique()to explore categorical variables - Save processed data as a CSV file
- Download GBIF occurrence records using the
rgbifpackage and perform basic exploratory analysis
2.5.2 Recomended readings
- Chamberlain S, Barve V, Mcglinn D, Oldoni D, Desmet P, Geffert L & Ram K (2024) rgbif: Interface to the Global Biodiversity Information Facility API. R package. https://docs.ropensci.org/rgbif/
- GBIF (2024) What is GBIF? https://www.gbif.org/what-is-gbif
- Quarto (2024) Get Started. https://quarto.org/docs/get-started/
2.5.3 Tutorial
2.5.3.1 R Projects
2.5.3.1.1 What is an R Project?
An R Project (.Rproj file) is a container that links a working directory to a set of project-specific settings in RStudio. When you open an R Project, R automatically sets the working directory to the project folder. This makes file paths short, portable, and consistent across computers, which is a cornerstone of reproducible research.
Without an R Project, you need to manually call setwd("some/very/long/path/on/my/computer") at the top of every script. That path will break as soon as you move the folder or share it with a colleague.
2.5.3.1.2 Working directory and setwd()
The working directory is the folder on your computer where R looks for files and saves outputs by default. You can check it at any time with getwd(), and change it manually with setwd():
## [1] "C:/Users/ploti/Dropbox/Work/Teaching/02_Biodiversity of plants/R_exercise/bsc-biodiv-der-pflanzen"
# Change it manually (example, no need to run this now)
# setwd("C:/Users/YourName/Documents/my_project")The problem with setwd() is that the path is unique to your computer. If you send your script to a colleague or move the folder, the path breaks and the script stops working. For this reason, we will not use setwd() in this course. Instead, we use R Projects (see below), which set the working directory automatically and portably.
2.5.3.1.3 Creating an R Project
In RStudio:
- Go to File → New Project…
- Choose New Directory → New Project
- Give it a name (e.g.,
plants_are_fun) and choose a location - Click Create Project
RStudio will restart and the working directory is automatically set to the project folder, no setwd() needed. You can confirm it:
## [1] "C:/Users/ploti/Dropbox/Work/Teaching/02_Biodiversity of plants/R_exercise/bsc-biodiv-der-pflanzen"
2.5.3.2 Setting up a project folder structure
Good data management starts with a clear folder structure. We will create three folders inside our project directory directly from R.
2.5.3.2.1 Creating folders with R
We will first create an R scrpt file to code. You can create it from the menu in File → New File → R script, then name the file and save it. We will write the code in here.
Let’s start with the first function:
Functions look like a word or combination of words, no space in between, followed by a parenthesis. The words inside the parenthesis are the arguments. To check what the function does and read the explanation of the arguments, use ?function_name to open the help panel.
The function dir.create() creates a new folder. The argument showWarnings = FALSE silences the warning if the folder already exists.
#?dir.create
# Create project sub-folders
dir.create("data", showWarnings = FALSE)
dir.create("plots", showWarnings = FALSE)
dir.create("output", showWarnings = FALSE)Your project should now look like this:
my_project/
├── data/ ← raw input data (CSV files, shapefiles, …)
├── plots/ ← figures and maps you produce
├── output/ ← processed data and result tables
└── my_cool_project.R
2.5.3.3 Basic R objects
Before loading real data, let us get comfortable with the basic building blocks of R.
2.5.3.3.1 Variables
You assign a value to a variable with the assignment operator <-.
# Assign a number
x <- 2026
# Assign a character string
my_name <- "Maria"
# Print by typing the variable name
x## [1] 2026
## [1] "Maria"
2.5.3.3.2 Vectors
A vector is a sequence of values of the same type, created with c() (combine).
# A numeric vector
heights <- c(1.2, 0.8, 1.5, 1.1, 2.0)
# A character vector
species <- c("Viola odorata", "Primula vulgaris", "Ranunculus acris")
# Check the type
class(heights)## [1] "numeric"
## [1] "character"
## [1] 5
2.5.3.3.3 Data frames
A data frame is a table where each column is a vector. This is the most important data structure for data analysis in R.
# Create a small data frame manually
df <- data.frame(
species = c("Viola odorata", "Primula vulgaris", "Ranunculus acris"),
family = c("Violaceae", "Primulaceae", "Ranunculaceae"),
year = c(1987, 2001, 1965)
)
# Print it
df## species family year
## 1 Viola odorata Violaceae 1987
## 2 Primula vulgaris Primulaceae 2001
## 3 Ranunculus acris Ranunculaceae 1965
2.5.3.4 Importing data
Real data usually comes as a CSV file. We load it with read.csv().
Note:
read.csv()is base R. It expects comma-separated values and treats the first row as column names by default.
Download the file
my_records.csvfrom ILIAS under the practice folderIntroduction to R
# Read a CSV file from the data/ folder
records <- read.csv("data/my_records.csv")
# We can remove unnecessary columns to make the data easier to handle
# We are using indexation in the following line, see how to use it below
records <- records[, c("key", "taxonID", "family", "genus", "species", "scientificName", "decimalLatitude", "decimalLongitude", "taxonRank", "basisOfRecord", "year", "country", "countryCode", "stateProvince", "locality", "iucnRedListCategory", "individualCount", "occurrenceStatus")]2.5.3.4.1 Inspecting the data
Once loaded, always start by looking at the data before doing anything else.
## key taxonID family genus species
## 1 182368733 <NA> Rhamnaceae Frangula Frangula alnus
## 2 1291505203 <NA> Vireonidae Pteruthius Pteruthius aenobarbus
## 3 1826325281 <NA> Geometridae Agriopis Agriopis marginaria
## 4 2238275141 15086 Strophariaceae Hypholoma Hypholoma subericaeum
## 5 2973785847 <NA> Bufonidae Rhinella <NA>
## 6 2979435480 BOLD:ACR5517 Oonopidae <NA> <NA>
## scientificName decimalLatitude decimalLongitude
## 1 Frangula dodonei subsp. dodonei 47.66690 2.26047
## 2 Pteruthius aenobarbus (Temminck, 1836) NA NA
## 3 Agriopis marginaria (Fabricius) 52.93230 -6.23110
## 4 Hypholoma subericaeum (Fr.) Kühner 56.38486 9.89984
## 5 Rhinella Fitzinger, 1826 -20.44553 -41.87311
## 6 BOLD:ACR5517 10.76060 -85.33500
## taxonRank basisOfRecord year country
## 1 SUBSPECIES OBSERVATION 2026 France
## 2 SPECIES PRESERVED_SPECIMEN 2026 Lao People’s Democratic Republic
## 3 SPECIES PRESERVED_SPECIMEN 2026 Ireland
## 4 SPECIES PRESERVED_SPECIMEN 2026 Denmark
## 5 GENUS PRESERVED_SPECIMEN 2026 Brazil
## 6 UNRANKED MATERIAL_SAMPLE 2026 Costa Rica
## countryCode stateProvince locality
## 1 FR <NA> ISDES
## 2 LA <NA> Xieng-Khowang
## 3 IE Wicklow Township: Rathdrum
## 4 DK <NA> Langå Egeskov
## 5 BR Minas Gerais Parque Nacional do Caparaó
## 6 CR Guanacaste Province PL12-8
## iucnRedListCategory individualCount occurrenceStatus
## 1 <NA> NA PRESENT
## 2 LC NA PRESENT
## 3 <NA> 1 PRESENT
## 4 <NA> NA PRESENT
## 5 <NA> 1 PRESENT
## 6 <NA> NA PRESENT
## key taxonID family genus species scientificName
## 995 3442797880 BOLD:ACD6520 Ichneumonidae Stenomacrus <NA> BOLD:ACD6520
## 996 3442797881 BOLD:AEJ2062 Chironomidae <NA> <NA> BOLD:AEJ2062
## 997 3442797882 BOLD:ACT2121 Psychodidae <NA> <NA> BOLD:ACT2121
## 998 3442797883 BOLD:ACT2121 Psychodidae <NA> <NA> BOLD:ACT2121
## 999 3442797884 BOLD:ACE1029 Sciaridae Pseudosciara <NA> BOLD:ACE1029
## 1000 3442798206 BOLD:ACR5891 Bostrichidae <NA> <NA> BOLD:ACR5891
## decimalLatitude decimalLongitude taxonRank basisOfRecord year country
## 995 10.7606 -85.3350 UNRANKED MATERIAL_SAMPLE 2026 Costa Rica
## 996 10.7606 -85.3350 UNRANKED MATERIAL_SAMPLE 2026 Costa Rica
## 997 10.7606 -85.3350 UNRANKED MATERIAL_SAMPLE 2026 Costa Rica
## 998 10.7606 -85.3350 UNRANKED MATERIAL_SAMPLE 2026 Costa Rica
## 999 10.7606 -85.3350 UNRANKED MATERIAL_SAMPLE 2026 Costa Rica
## 1000 10.7634 -85.3347 UNRANKED MATERIAL_SAMPLE 2026 Costa Rica
## countryCode stateProvince locality iucnRedListCategory
## 995 CR Guanacaste Province PL12-8 <NA>
## 996 CR Guanacaste Province PL12-8 <NA>
## 997 CR Guanacaste Province PL12-8 <NA>
## 998 CR Guanacaste Province PL12-8 <NA>
## 999 CR Guanacaste Province PL12-8 <NA>
## 1000 CR Guanacaste Province PL12-2 <NA>
## individualCount occurrenceStatus
## 995 NA PRESENT
## 996 NA PRESENT
## 997 NA PRESENT
## 998 NA PRESENT
## 999 NA PRESENT
## 1000 NA PRESENT
## 'data.frame': 1000 obs. of 18 variables:
## $ key : num 1.82e+08 1.29e+09 1.83e+09 2.24e+09 2.97e+09 ...
## $ taxonID : chr NA NA NA "15086" ...
## $ family : chr "Rhamnaceae" "Vireonidae" "Geometridae" "Strophariaceae" ...
## $ genus : chr "Frangula" "Pteruthius" "Agriopis" "Hypholoma" ...
## $ species : chr "Frangula alnus" "Pteruthius aenobarbus" "Agriopis marginaria" "Hypholoma subericaeum" ...
## $ scientificName : chr "Frangula dodonei subsp. dodonei" "Pteruthius aenobarbus (Temminck, 1836)" "Agriopis marginaria (Fabricius)" "Hypholoma subericaeum (Fr.) Kühner" ...
## $ decimalLatitude : num 47.7 NA 52.9 56.4 -20.4 ...
## $ decimalLongitude : num 2.26 NA -6.23 9.9 -41.87 ...
## $ taxonRank : chr "SUBSPECIES" "SPECIES" "SPECIES" "SPECIES" ...
## $ basisOfRecord : chr "OBSERVATION" "PRESERVED_SPECIMEN" "PRESERVED_SPECIMEN" "PRESERVED_SPECIMEN" ...
## $ year : int 2026 2026 2026 2026 2026 2026 2026 2026 2026 2026 ...
## $ country : chr "France" "Lao People’s Democratic Republic" "Ireland" "Denmark" ...
## $ countryCode : chr "FR" "LA" "IE" "DK" ...
## $ stateProvince : chr NA NA "Wicklow" NA ...
## $ locality : chr "ISDES" "Xieng-Khowang" "Township: Rathdrum" "Langå Egeskov" ...
## $ iucnRedListCategory: chr NA "LC" NA NA ...
## $ individualCount : int NA NA 1 NA 1 NA NA NA NA NA ...
## $ occurrenceStatus : chr "PRESENT" "PRESENT" "PRESENT" "PRESENT" ...
## key taxonID family genus
## Min. :1.824e+08 Length:1000 Length:1000 Length:1000
## 1st Qu.:3.442e+09 Class :character Class :character Class :character
## Median :3.442e+09 Mode :character Mode :character Mode :character
## Mean :3.422e+09
## 3rd Qu.:3.442e+09
## Max. :3.443e+09
##
## species scientificName decimalLatitude decimalLongitude
## Length:1000 Length:1000 Min. :-22.89 Min. :-85.34
## Class :character Class :character 1st Qu.: 10.76 1st Qu.:-85.33
## Mode :character Mode :character Median : 10.76 Median :-85.33
## Mean : 10.83 Mean :-84.83
## 3rd Qu.: 10.76 3rd Qu.:-85.33
## Max. : 56.38 Max. : 73.75
## NA's :1 NA's :1
## taxonRank basisOfRecord year country
## Length:1000 Length:1000 Min. :2026 Length:1000
## Class :character Class :character 1st Qu.:2026 Class :character
## Mode :character Mode :character Median :2026 Mode :character
## Mean :2026
## 3rd Qu.:2026
## Max. :2026
##
## countryCode stateProvince locality iucnRedListCategory
## Length:1000 Length:1000 Length:1000 Length:1000
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## individualCount occurrenceStatus
## Min. :1 Length:1000
## 1st Qu.:1 Class :character
## Median :1 Mode :character
## Mean :1
## 3rd Qu.:1
## Max. :1
## NA's :998
str() is particularly useful, it tells you at a glance how many rows and columns you have, what type each column is (numeric, character, factor), and what the first few values look like.
2.5.3.4.2 Column names
## [1] "key" "taxonID" "family"
## [4] "genus" "species" "scientificName"
## [7] "decimalLatitude" "decimalLongitude" "taxonRank"
## [10] "basisOfRecord" "year" "country"
## [13] "countryCode" "stateProvince" "locality"
## [16] "iucnRedListCategory" "individualCount" "occurrenceStatus"
## [1] "key" "taxonID" "family"
## [4] "genus" "species" "scientificName"
## [7] "decimalLatitude" "decimalLongitude" "taxonRank"
## [10] "basisOfRecord" "year" "country"
## [13] "countryCode" "stateProvince" "locality"
## [16] "iucnRedListCategory" "individualCount" "occurrenceStatus"
## [1] 1000
## [1] 18
## [1] 1000 18
2.5.3.5 Indexing and subsetting
2.5.3.5.1 Bracket notation: [row, column]
Data frames are indexed with [rows, columns]. Leaving a position blank means “all”.
## key taxonID family genus species
## 1 182368733 <NA> Rhamnaceae Frangula Frangula alnus
## scientificName decimalLatitude decimalLongitude taxonRank
## 1 Frangula dodonei subsp. dodonei 47.6669 2.26047 SUBSPECIES
## basisOfRecord year country countryCode stateProvince locality
## 1 OBSERVATION 2026 France FR <NA> ISDES
## iucnRedListCategory individualCount occurrenceStatus
## 1 <NA> NA PRESENT
## key taxonID family genus species
## 1 182368733 <NA> Rhamnaceae Frangula Frangula alnus
## 2 1291505203 <NA> Vireonidae Pteruthius Pteruthius aenobarbus
## 3 1826325281 <NA> Geometridae Agriopis Agriopis marginaria
## scientificName decimalLatitude decimalLongitude
## 1 Frangula dodonei subsp. dodonei 47.6669 2.26047
## 2 Pteruthius aenobarbus (Temminck, 1836) NA NA
## 3 Agriopis marginaria (Fabricius) 52.9323 -6.23110
## taxonRank basisOfRecord year country
## 1 SUBSPECIES OBSERVATION 2026 France
## 2 SPECIES PRESERVED_SPECIMEN 2026 Lao People’s Democratic Republic
## 3 SPECIES PRESERVED_SPECIMEN 2026 Ireland
## countryCode stateProvince locality iucnRedListCategory
## 1 FR <NA> ISDES <NA>
## 2 LA <NA> Xieng-Khowang LC
## 3 IE Wicklow Township: Rathdrum <NA>
## individualCount occurrenceStatus
## 1 NA PRESENT
## 2 NA PRESENT
## 3 1 PRESENT
## [1] NA
## [1] "Rhamnaceae"
2.5.3.5.2 The $ operator
Use $ to access a single column by name. The result is a vector.
# Access the species column (defined in DarwinCore terminology as 'scientificName'), first 10 rows
records$scientificName[1:10]## [1] "Frangula dodonei subsp. dodonei"
## [2] "Pteruthius aenobarbus (Temminck, 1836)"
## [3] "Agriopis marginaria (Fabricius)"
## [4] "Hypholoma subericaeum (Fr.) Kühner"
## [5] "Rhinella Fitzinger, 1826"
## [6] "BOLD:ACR5517"
## [7] "BOLD:ACD1886"
## [8] "BOLD:ACR5517"
## [9] "BOLD:AEE2514"
## [10] "BOLD:ACR5517"
## [1] 2026 2026 2026 2026 2026 2026 2026 2026 2026 2026
This is a messy dataset. That is the reason we must ALWAYS perform cleaning
2.5.3.5.3 Unique values
unique() returns the distinct values in a vector (useful for exploring categorical columns).
## [1] "France" "Lao People’s Democratic Republic"
## [3] "Ireland" "Denmark"
## [5] "Brazil" "Costa Rica"
## [7] "India"
## [1] "OBSERVATION" "PRESERVED_SPECIMEN" "MATERIAL_SAMPLE"
## [4] "HUMAN_OBSERVATION"
2.5.3.5.4 Logical subsetting (filtering)
You can filter rows by putting a logical condition in the row position.
# Keep only rows where country is "Brazil"
brazil <- records[records$country == "Brazil", ]
# Keep only rows from 2000 onwards
recent <- records[records$year >= 2000, ]
# Combine conditions with & (AND) or | (OR)
recent_brazil <- records[records$year >= 2000 & records$country == "Brazil", ]2.5.3.6 Saving data
2.5.3.6.1 Save a data frame as CSV
Use write.csv() to write a data frame to disk. Setting row.names = FALSE prevents R from adding a redundant row-number column.
2.5.3.6.2 Save and load R objects
To save R objects (e.g., a processed data frame) in R’s native binary format, use save(). This preserves all data types exactly and is faster to reload than CSV.
# Save one or more objects
save(brazil, file = "output/Brazil_records.RData")
# Reload them in a future session
load("output/Brazil_records.RData")2.5.3.7 The use of packages and libraries: downloading GBIF data with rgbif
Last class we learned what is GBIF and the type of data it gathers. We can download the data directly on the web but the R package rgbif provides direct access to the GBIF API.
We need to first install and load the packages
## Warning: package 'rgbif' was built under R version 4.5.2
Find the taxon key: GBIF identifies taxa with numeric keys. Use name_backbone() to look up the key for a name of interest.
## # A tibble: 1 × 26
## usageKey scientificName canonicalName authorship rank status type
## * <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 5297341 Epidendrum L. Epidendrum L. GENUS ACCEPTED SCIENTIFIC
## # ℹ 19 more variables: formattedName <chr>, matchType <chr>, confidence <int>,
## # timeTaken <int>, kingdom <chr>, phylum <chr>, class <chr>, order <chr>,
## # family <chr>, genus <chr>, kingdomKey <chr>, phylumKey <chr>,
## # classKey <chr>, orderKey <chr>, familyKey <chr>, genusKey <chr>,
## # value <lgl>, verbatim_name <chr>, verbatim_rank <chr>
# usageKey is for species (scientificName) but you also have the taxon key for the family, genus, order, etc
# Extract the key
genusKey <- taxon_info$genusKey
genusKey## [1] "5297341"
Download occurrence records: occ_search() queries the GBIF occurrence API. We set a limit to cap the number of records returned (the maximum is 100,000 per call for larger datasets use occ_download()).
# Download up to 1000 records
gbif_raw <- occ_search(
taxonKey = genusKey,
limit = 1000
)
# The actual data frame is stored in the $data element
records <- gbif_raw$dataFirst look at the data
## # A tibble: 6 × 127
## key scientificName decimalLatitude decimalLongitude issues datasetKey
## <chr> <chr> <dbl> <dbl> <chr> <chr>
## 1 5938085880 Epidendrum radi… 9.84 -83.9 cdc,c… 50c9509d-…
## 2 5938101363 Epidendrum cent… 10.3 -84.8 cdc,c… 50c9509d-…
## 3 5938125049 Epidendrum radi… 8.78 -82.4 cdc,c… 50c9509d-…
## 4 5938130699 Epidendrum fimb… 5.08 -75.4 cdc,c… 50c9509d-…
## 5 5938145759 Epidendrum fimb… 6.29 -75.5 cdc,c… 50c9509d-…
## 6 5938212569 Epidendrum park… 16.0 -96.5 cdc,c… 50c9509d-…
## # ℹ 121 more variables: publishingOrgKey <chr>, installationKey <chr>,
## # hostingOrganizationKey <chr>, publishingCountry <chr>, protocol <chr>,
## # lastCrawled <chr>, lastParsed <chr>, crawlId <int>, basisOfRecord <chr>,
## # occurrenceStatus <chr>, taxonKey <int>, kingdomKey <int>, phylumKey <int>,
## # classKey <int>, orderKey <int>, familyKey <int>, genusKey <int>,
## # speciesKey <int>, acceptedTaxonKey <int>, acceptedScientificName <chr>,
## # kingdom <chr>, phylum <chr>, order <chr>, family <chr>, genus <chr>, …
dim(records) #it has many columns, you can just use the ones you prefer with the indexing brackets as shown above to select only the desired## [1] 1000 127
Filter only herbarium specimens: occurrence records come from different sources (basisOfRecord). For botanical collection data, we are often most interested in preserved specimens (physical herbarium sheets).
## [1] "HUMAN_OBSERVATION" "PRESERVED_SPECIMEN"
# We should keep only preserved specimens since they have the most reliable identification
preserved <- records[records$basisOfRecord == "PRESERVED_SPECIMEN", ]Which species has the most records?
table() counts occurrences of each unique value. sort() with decreasing = TRUE puts the most common first.
##
## Epidendrum radicans Pav. ex Lindl.
## 120
## Epidendrum conopseum R.Br.
## 80
## Epidendrum fimbriatum Kunth
## 65
## Epidendrum amphistomum A.Rich.
## 55
## Epidendrum rigidum Jacq.
## 52
## Epidendrum nocturnum Jacq.
## 46
## Epidendrum centropetalum Rchb.f.
## 36
## Epidendrum arachnoglossum Rchb.f. ex André
## 32
## Epidendrum ciliare L.
## 31
## Epidendrum strobiliferum Rchb.f.
## 30
What is the earliest and latest collection year?
## [1] 2026
## [1] 2026
Those are recent records but if you have more, you now know how to filter for records from specific years
Notice the
na.rm = TRUEargument. It tells R to ignore missing values (NA). Without it,min()would returnNAif any year is missing.
How many records are missing a year?
## [1] 0
Filter records for a specific country
Country codes in GBIF follow the ISO 3166-1 alpha-2 standard (e.g., "DE" for Germany, "CO" for Colombia, "US" for the United States).
## [1] "CR" "PA" "CO" "MX" "PE" "US" "BR" "AU" "EC" "GP" "HN" "GF" "VI" "GT" NA
## [16] "BQ" "PR" "GD" "BO" "DM" "KE" "ES" "BZ" "NZ" "SH" "NI" "MS" "MQ" "TZ" "KN"
## [31] "VC" "LC" "VE"
## [1] 190
## # A tibble: 6 × 127
## key scientificName decimalLatitude decimalLongitude issues datasetKey
## <chr> <chr> <dbl> <dbl> <chr> <chr>
## 1 5938130699 Epidendrum fimb… 5.08 -75.4 cdc,c… 50c9509d-…
## 2 5938145759 Epidendrum fimb… 6.29 -75.5 cdc,c… 50c9509d-…
## 3 5938279268 Epidendrum fimb… 5.08 -75.4 cdc,c… 50c9509d-…
## 4 6129828395 Epidendrum fimb… 4.69 -75.6 cdc,c… 50c9509d-…
## 5 6129829570 Epidendrum fimb… 6.73 -75.7 cdc,c… 50c9509d-…
## 6 6129854781 Epidendrum fimb… 6.33 -75.6 cdc,c… 50c9509d-…
## # ℹ 121 more variables: publishingOrgKey <chr>, installationKey <chr>,
## # hostingOrganizationKey <chr>, publishingCountry <chr>, protocol <chr>,
## # lastCrawled <chr>, lastParsed <chr>, crawlId <int>, basisOfRecord <chr>,
## # occurrenceStatus <chr>, taxonKey <int>, kingdomKey <int>, phylumKey <int>,
## # classKey <int>, orderKey <int>, familyKey <int>, genusKey <int>,
## # speciesKey <int>, acceptedTaxonKey <int>, acceptedScientificName <chr>,
## # kingdom <chr>, phylum <chr>, order <chr>, family <chr>, genus <chr>, …
2.5.4 Tasks
Apply what you have learned today. Save your answers (code + short written explanations) in the R document and submit it on ILIAS at the specified date:
1. Download occurrence records from GBIF for a plant taxon of your choice (a family, genus, or species you find interesting). Use name_backbone() to find the taxon key and occ_search() to download at least 500 records. Save the raw data as a CSV in your data/ folder.
2. Inspect your dataset: how many rows and columns does it have? What are the column names? Which columns contain missing values, and how many?
3. Filter the dataset to preserved specimens only. How many records remain? What percentage of the original download are preserved specimens?
4. Answer the following questions using base R functions:
Which family (or genus, if you searched at genus level) has the most records?
What is the earliest and latest collection year in the preserved specimen subset?
How many records come from each country? Show the top 5.
5. Filter the preserved specimens to a single country of your choice and save the result as a CSV in your output/ folder.
6. Reflection (3–5 sentences): What was the most surprising thing you found in your GBIF data? Was there anything unexpected about the number of missing values, the time range of records, or the geographic distribution?
2.5.5 Extra 1: using GBIF API
Proper citation of GBIF data is important: it gives credit to the institutions and collectors who digitized the specimens, and it makes your analysis reproducible by pointing other researchers to the exact dataset you used.
Two ways to download, two ways to cite:
occ_search() is convenient for exploration, but it does not register a download on GBIF’s servers and therefore has no DOI. For publications and assignments where reproducibility matters, you should use occ_download() instead. This submits a formal request to GBIF, which processes the data, assigns a permanent DOI, and keeps the file available for download indefinitely.
occ_download() requires a free GBIF account. Register at gbif.org and then provide your credentials once per session:
Code is commented. When adding your username and password, uncomment and run.
# Set your GBIF account credentials (do this once per session)
# options(
# gbif_user = "your_username",
# gbif_pwd = "your_password",
# gbif_email = "your_email@example.com"
# )Then submit the download request using predicates, filters that define exactly which records you want:
# Request a download for preserved specimens of Epidendrum (we saved the taxonkey in an object before)
# dl <- occ_download(
# pred("taxonKey", order_key),
# pred("basisOfRecord", "PRESERVED_SPECIMEN")
# )
# See the other arguments in ?occ_downloadGBIF processes the request on its servers (usually takes 1–10 minutes). Use occ_download_wait() to pause R until it is ready:
# Wait until the download is ready
# occ_download_wait(dl)
# Load the downloaded data into R
# records_dl <- occ_download_get(dl) |> occ_download_import()Finding the DOI
Once the download is complete, retrieve its metadata to find the DOI:
The DOI looks something like 10.15468/dl.xxxxxx and links to a permanent page on GBIF where anyone can access the exact same dataset. Check citation in your account under DOWNLOADS. It looks something like this:
GBIF.org (2024) GBIF Occurrence Download. https://doi.org/10.15468/dl.xxxxxx
For any formal submission (report, thesis, paper), always use occ_download() so your data has a citable DOI.
2.5.6 Extra 2: filtering native range (rWCVP package)
Plants of the World Online (POWO) provides native distribution data for all vascular plants. The rWCVP package gives direct access to this data in R, using the World Checklist of Vascular Plants (WCVP) as its backbone. We can use it to remove GBIF records that fall outside a taxon’s native range.
The key idea: wcvp_distribution() returns the native range as a spatial polygon (sf object). We convert our occurrence records to spatial points and keep only those that fall inside the polygon.
# install.packages("remotes")
# library(remotes)
# remotes::install_github("matildabrown/rWCVP")
# library(rWCVP)
# remotes::install_github('matildabrown/rWCVPdata')
# library(rWCVPdata)
# install.packages("sf")
# library(sf)
#
# # Explore distribution polygon for our genus from WCVP/POWO.
# distribution <- wcvp_distribution("Epidendrum", taxon_rank = "genus")
#
# # global map
# wcvp_distribution_map(distribution)
#
# # zoomed-in map
# wcvp_distribution_map(distribution, crop_map = TRUE)
#
# # Get the native distribution only. Setting introduced = FALSE and extinct = FALSE keeps only the native range.
# native_range <- wcvp_distribution("Epidendrum", taxon_rank = "genus",
# introduced = FALSE, extinct = FALSE,
# location_doubtful = FALSE)
#
# # Quick visual check of the native range
# wcvp_distribution_map(native_range)
#
# # Convert GBIF records to spatial points (requires coordinates to be present)
# records_sf <- st_as_sf(
# records[!is.na(records$decimalLongitude) & !is.na(records$decimalLatitude), ],
# coords = c("decimalLongitude", "decimalLatitude"),
# crs = st_crs(4326)
# )
#
# # Check which points fall inside the native range polygon.
# # st_union() merges all range polygons into one before intersecting.
# # The result is a logical vector: TRUE = point is inside the native range.
# records_sf$native <- st_intersects(records_sf, st_union(native_range),
# sparse = FALSE)[, 1]
#
# # Filter to native records only using base R subsetting
# native_records <- records_sf[records_sf$native, ]
# nrow(records)
# nrow(native_records)
#
# # After this, we check the class of our dataset
# class(native_records)
#
# # Your dataset now has the latotude and longitude in the column geometry
# # To convert your dataset back to a dataframe, you use 'st_drop_geometry'
# library("tidyverse") # this package is going to be explained next class
#
# native_records <- native_records %>%
# mutate(
# decimalLongitude = st_coordinates(.)[, 1],
# decimalLatitude = st_coordinates(.)[, 2]
# ) %>%
# st_drop_geometry()Note: The WCVP polygons use TDWG Level 3 regions (coarse political units), so coastal or border records may be flagged as outside the range even if they are biologically native. You can add a small buffer with
st_buffer(st_union(native_range), 0.009)(≈ 1 km at the equator) to be less strict.
2.5.7 Extra 3: perform geographic cleaning (CoordinateCleaner package)
Even after filtering to the native range, GBIF records often contain coordinate errors: points placed at country centroids, at biodiversity institution locations, at (0, 0), or with swapped latitude/longitude. The CoordinateCleaner package runs a battery of automated tests to flag these issues.
# # install.packages("CoordinateCleaner")
# library(CoordinateCleaner)
#
# # Run coordinate cleaning tests on the records data frame.
# # Each test returns TRUE (clean) or FALSE (flagged).
# # .summary is TRUE only if ALL tests pass for that record.
# clean_flags <- clean_coordinates(
# x = native_records,
# lon = "decimalLongitude",
# lat = "decimalLatitude",
# species = "species",
# tests = c("capitals", # point at a national capital
# "centroids", # point at a country or province centroid
# "equal", # latitude == longitude
# "gbif", # point at GBIF headquarters
# "institutions", # point at a herbarium or museum
# "zeros") # coordinates are exactly 0, 0
# )
#
# # Inspect how many records were flagged by each test
# summary(clean_flags)
#
# # Keep only records that passed all tests (.summary == TRUE)
# records_clean <- native_records[clean_flags$.summary, ]
# nrow(records_clean)
# nrow(records) - nrow(records_clean) # number of records removed
#
# # Save the cleaned dataset as a CSV in the output folder
# write.csv(records_clean, "output/records_clean.csv", row.names = FALSE)Tip:
clean_coordinates()is conservative by default. It is a good idea to inspect flagged records before discarding them. You can plot the flagged vs. clean points to visually verify the results, or relax specific tests by removing them from thetestsvector.