2.4 Big data in biodiversity research

2.4.1 Background

Advances in data collection, storage, and digital transfer have transformed biodiversity research into a big data science. What once required decades of fieldwork and manual cataloguing can now be explored in seconds through open-access databases containing billions of records. These resources compile occurrence data, taxonomic knowledge, functional traits, and genetic sequences — and they are freely available to anyone with an internet connection.

Biodiversity databases are not just repositories: they are living scientific infrastructure, continuously updated by researchers, institutions, herbaria, natural history museums, and citizen scientists around the world. Understanding what each database contains, what it was built for, and where its limitations lie is an essential skill for any modern biologist.

2.4.2 Learning objectives

By the end of this session, you will be able to:

  1. Describe how biodiversity research has evolved into a “big data” science.
  2. Identify the major types of open-access biodiversity databases and what kind of data each holds.
  3. Independently navigate four major databases: GBIF, POWO, TRY, and GenBank.
  4. Critically evaluate the availability and quality of data for a plant family of your choice.

2.4.3 Required Preparation

None — but bring curiosity! Choose a plant genus or family that interests you before the session starts. You must choose one with between 20-300 species.
(Not sure? Try: Orchidaceae, Cactaceae, Bromeliaceae, Poaceae, Fabaceae, or Araceae)

2.4.4 Part I: Occurrence & distribution repositories

What are they?

Occurrence repositories compile records of where and when a species was observed or collected. Each record typically includes a species name, geographic coordinates, date, and information about who recorded it and how (field observation, herbarium specimen, camera trap, etc.).

These databases are the backbone of macroecological research: they are used to model species distributions, track range shifts under climate change, identify biodiversity hotspots, and guide conservation planning.

Major databases of this type are:

Database Focus Records URL
GBIF All life on Earth ~3.7 billion gbif.org
iNaturalist Citizen science observations ~200 million inaturalist.org
BIEN Plants of the Americas ~200 million biendata.org
Paleobiology DB Fossil occurrences ~1.5 million paleobiodb.org

Note: iNaturalist data feeds directly into GBIF — when you explore GBIF, you are already seeing iNaturalist observations alongside museum specimens and research surveys.

🌍 GBIF — Global Biodiversity Information Facility

What is it?
GBIF is the world’s largest open-access biodiversity data infrastructure. It was established in 2001 under an OECD agreement and is now supported by governments and institutions from over 100 countries. GBIF does not collect data itself — it aggregates data from thousands of contributing institutions (natural history museums, herbaria, universities, citizen science platforms) into a single searchable portal.

What kind of data?

  • Species occurrence records (observations, herbarium specimens, fossil records, environmental DNA)
  • Each record includes: species name, coordinates, date, collector/observer, data source, and quality flags
  • Covers all life: plants, fungi, animals, bacteria

By the numbers:

  • 📦 ~3.7 billion occurrence records
  • 🏛 67,000+ contributing datasets
  • 🌱 +1.8 million species with at least one record
  • 📅 Records dating back to the 1600s (digitized herbarium specimens)

What is it used for?
Species distribution modeling, biodiversity assessments, conservation planning, climate change impact studies, invasive species monitoring.

Known limitations:
Records are not evenly distributed — Europe, North America, and Australia are heavily overrepresented. Tropical regions with the highest diversity often have the fewest records. Older records may have coordinate errors or outdated taxonomy.

Hands-on

Step 1 — Create your account

  1. Go to www.gbif.org
  2. Click Login (top right) → Register
  3. Fill in your details and confirm your email
  4. Log in — you are now part of the global GBIF community

Step 2 — Explore the interface

  1. On the homepage, take a moment to look at the numbers on the screen: How many occurrences are there? How many species?
  2. Click on Data in the top menu → Occurrences to see the global map of all records

Step 3 — Search for your plant family

  1. In the search bar at the top, type the name of your chosen plant family (e.g., Orchidaceae)
  2. GBIF will suggest matching options — click on the family name
  3. You land on the taxon page. Read the summary: how many species are accepted in this family? How many occurrences are in GBIF?

Step 4 — Explore the occurrence map

  1. Click on the Occurrences tab
  2. Look at the map — where are most records concentrated? Are there large gaps?
  3. Zoom into a region that interests you
  4. Click on a cluster of dots to see the individual records

Step 5 — Filter by country

  1. On the left panel, find the Country or area filter
  2. Select a country of your choice (e.g., Colombia, Germany, Brazil)
  3. How many records are there now? How does that compare to the global count?
  4. Try filtering by Basis of record: what is the difference between a Human observation and a Preserved specimen?

Step 6 — Inspect a single record

  1. Click on any single occurrence point on the map
  2. Open the full record. What information does it contain?
  3. Look for the Issues and flags section — are there any data quality warnings?
  4. Can you find the original data source (which institution or project published this record)?

2.4.5 Part II: Taxonomic & floristic databases

What are they?

Taxonomic databases provide authoritative information on species names: which names are accepted, which are synonyms, who described the species and when, and what is known about its distribution at a regional or global level. Unlike occurrence databases, they do not track individual sightings — they tell you what a species is and where it belongs.

Major databases of this type

Database Focus URL
POWO Vascular plants & bryophytes powo.science.kew.org
World Flora Online All plant groups worldfloraonline.org
Catalogue of Life All life catalogueoflife.org
Flora of the World Detailed regional floras floraoftheworld.org

🌿 POWO — Plants of the World Online

What is it?
POWO is maintained by the Royal Botanic Gardens, Kew (UK), one of the world’s leading plant science institutions. It is the global reference for plant taxonomy — the definitive answer to “is this name valid?” for vascular plants, mosses, and liverworts.

What kind of data?

  • Accepted species names and all their synonyms
  • Original publication (who described the species and when)
  • Native and introduced distribution by country and botanical region
  • Images and links to related resources

By the numbers:

  • 🌿 ~350,000 accepted plant species
  • 📚 Covers all vascular plants (flowering plants, ferns, conifers) and bryophytes
  • 🗺 Distribution data for ~200,000 species

What is it used for?
Checking whether a species name is valid, finding the accepted name when a synonym is used in old literature, understanding native vs. introduced ranges, and taxonomic research.

Known limitations:
Distribution data reflects botanical knowledge (based on floras and monographs), not individual observations — it shows whether a species is known to occur in a country, not where specifically.

Hands-on

(No account needed — just browse)

Step 1 — Search for your plant family

  1. Go to powo.science.kew.org
  2. In the search bar, type the name of your family (e.g., Orchidaceae)
  3. Click on the family name in the results

Step 2 — Explore the family page

  1. How many genera does the family contain?
  2. Choose on genus — how many species does the genus contain?

Step 3 — Explore a species page

  1. Click on any species within your genus of interest
  2. Look at the Distribution tab — where is this species native? Has it been introduced anywhere?
  3. Compare the POWO native range map with the GBIF occurrence map for the same species — do they match? What differences do you notice?

Step 4 — Investigate synonymy

  1. On the species page, look for the Synonyms section
  2. How many synonyms does this species have?
  3. Why might a single species have many different names? Discuss.

2.4.6 Part III: Functional trait databases

What are they?

Functional trait databases record measurable characteristics of organisms — leaf size, plant height, seed mass, root depth, wood density, and hundreds of other traits that influence how a plant grows, reproduces, and interacts with its environment. These traits connect species identity to ecological function.

Major databases of this type

Database Focus URL
TRY Global plant traits try-db.org
LEDA NW European plant traits leda-traitbase.org
AusTraits Australian plant traits austraits.org
GIFT Global floristic traits gift.uni-goettingen.de

🌱 TRY Plant Trait Database

What is it?
TRY is a collaborative research network and database that compiles plant trait measurements contributed by research groups from around the world. It was launched in 2007 and has grown into the most comprehensive repository of plant functional traits available.

What kind of data?

  • Measurements of >700 different plant traits
  • Data contributed directly by researchers (not aggregated from citizen science)
  • Each record includes: species, trait, measured value, unit, location, and the contributing dataset

By the numbers:

  • 📊 >15 million trait records (2024)
  • 🌿 >280,000 plant taxa with at least one trait record
  • 📏 >700 traits (from leaf nitrogen content to stem hydraulics)
  • 👥 Data from >200 contributing research groups

What is it used for?
Functional ecology, trait-based community assembly studies, global biogeographic analyses, and Earth System Models that need plant functional parameters.

Known limitations:
Coverage is very uneven: well-studied groups (e.g., temperate grasses, European trees) have thousands of records, while many tropical families are barely represented. Some traits are far better measured than others.

Note: Downloading data from TRY requires submitting a formal data request (reviewed within a few days).

2.4.7 Part IV: Genetic & molecular databases

What are they?

Genetic databases store DNA and protein sequences, along with metadata about the organism they came from and the study that produced them. They are the foundation of molecular phylogenetics, DNA barcoding, and genomic research.

Major databases of this type

Database Focus URL
GenBank (NCBI) All DNA/RNA sequences ncbi.nlm.nih.gov/genbank
BOLD Systems DNA barcodes boldsystems.org
ENA European sequence archive ebi.ac.uk/ena
DDBJ Japanese sequence archive ddbj.nig.ac.jp

Note: GenBank, ENA, and DDBJ are part of the International Nucleotide Sequence Database Collaboration (INSDC) — they synchronize daily, so sequences submitted to one are available in all three.

🧬 GenBank — NCBI Nucleotide Database

What is it?
GenBank is maintained by the National Center for Biotechnology Information (NCBI), part of the US National Institutes of Health. It is the world’s largest publicly accessible nucleotide sequence database. Every sequence published in a scientific paper must be deposited in GenBank (or ENA/DDBJ) — making it a direct mirror of published molecular biology research.

What kind of data?

  • DNA and RNA sequences from all organisms
  • Complete genomes, individual genes, environmental DNA (eDNA)
  • For plants: commonly includes rbcL, matK, ITS (used for DNA barcoding and phylogenetics)
  • Each record includes: organism, gene/region, sequence, publication, and submitter

By the numbers:

  • 🧬 >250 million sequences (2024)
  • 📈 Doubles in size approximately every 18 months
  • 🌱 Sequences from >500,000 species
  • 📅 Data going back to the early 1980s

What is it used for?
Molecular phylogenetics (reconstructing evolutionary trees), DNA barcoding (identifying species from a DNA fragment), population genetics, and identifying unknown specimens.

Hands-on

(No account needed — just browse)

Step 1 — Navigate to the Nucleotide database

  1. Go to www.ncbi.nlm.nih.gov
  2. In the database dropdown (top left), select Nucleotide
  3. In the search bar, type the name of your plant family (e.g., Orchidaceae)
  4. How many sequences are available for your family?

Step 2 — Filter your search

  1. In the search bar, next to the genus, add the Gene name — try searching for rbcL (a chloroplast gene commonly used for plant barcoding)
  2. How many rbcL sequences exist for your family?

Step 3 — Inspect a sequence record

  1. Click on any result to open the full record
  2. Find the following information:
    • Which species is this from?
    • Which gene or genomic region was sequenced?
    • Who submitted this sequence, and in which year?
    • Is it linked to a published paper? Can you find the paper?
  3. Scroll down to the actual sequence — what does it look like?

2.4.8 Task

Now that you have explored each database, bring it all together. Choose one plant family (it can be the same one you used above, or a new one) and investigate it across all four databases. Answer the following questions:

  1. Where does your family occur according to GBIF?
  2. Are the records from GBIF in the described native distribution of the taxa? (POWO)
  3. How well is your family represented in the genetic record? (GenBank)

“You have now searched for the same plant family in different databases. Could any single one of them have answered all your questions? What would you lose if you only had access to one?”

2.4.9 Literature

  • Chamberlain, S. et al. (2021). rgbif: Interface to the Global Biodiversity Information Facility API. R package.
  • Kattge, J. et al. (2020). TRY plant trait database – enhanced coverage and open access. Global Change Biology, 26(1), 119–188.
  • Turland, N.J. et al. (2018). International Code of Nomenclature for algae, fungi, and plants. Regnum Vegetabile 159. Koeltz Botanical Books.
  • GBIF Secretariat (2023). GBIF — The Global Biodiversity Information Facility. www.gbif.org