2.4 Big data in biodiversity research
2.4.1 Background
Advances in data collection, storage, and digital transfer have transformed biodiversity research into a big data science. What once required decades of fieldwork and manual cataloguing can now be explored in seconds through open-access databases containing billions of records. These resources compile occurrence data, taxonomic knowledge, functional traits, and genetic sequences — and they are freely available to anyone with an internet connection.
Biodiversity databases are not just repositories: they are living scientific infrastructure, continuously updated by researchers, institutions, herbaria, natural history museums, and citizen scientists around the world. Understanding what each database contains, what it was built for, and where its limitations lie is an essential skill for any modern biologist.
2.4.2 Learning objectives
By the end of this session, you will be able to:
- Describe how biodiversity research has evolved into a “big data” science.
- Identify the major types of open-access biodiversity databases and what kind of data each holds.
- Independently navigate four major databases: GBIF, POWO, TRY, and GenBank.
- Critically evaluate the availability and quality of data for a plant family of your choice.
2.4.3 Required Preparation
None — but bring curiosity! Choose a plant genus or family that interests you before the session starts. You must choose one with between 20-300 species.
(Not sure? Try: Orchidaceae, Cactaceae, Bromeliaceae, Poaceae, Fabaceae, or Araceae)
2.4.4 Part I: Occurrence & distribution repositories
What are they?
Occurrence repositories compile records of where and when a species was observed or collected. Each record typically includes a species name, geographic coordinates, date, and information about who recorded it and how (field observation, herbarium specimen, camera trap, etc.).
These databases are the backbone of macroecological research: they are used to model species distributions, track range shifts under climate change, identify biodiversity hotspots, and guide conservation planning.
Major databases of this type are:
| Database | Focus | Records | URL |
|---|---|---|---|
| GBIF | All life on Earth | ~3.7 billion | gbif.org |
| iNaturalist | Citizen science observations | ~200 million | inaturalist.org |
| BIEN | Plants of the Americas | ~200 million | biendata.org |
| Paleobiology DB | Fossil occurrences | ~1.5 million | paleobiodb.org |
Note: iNaturalist data feeds directly into GBIF — when you explore GBIF, you are already seeing iNaturalist observations alongside museum specimens and research surveys.
🌍 GBIF — Global Biodiversity Information Facility
What is it?
GBIF is the world’s largest open-access biodiversity data infrastructure. It was established in 2001 under an OECD agreement and is now supported by governments and institutions from over 100 countries. GBIF does not collect data itself — it aggregates data from thousands of contributing institutions (natural history museums, herbaria, universities, citizen science platforms) into a single searchable portal.
What kind of data?
- Species occurrence records (observations, herbarium specimens, fossil records, environmental DNA)
- Each record includes: species name, coordinates, date, collector/observer, data source, and quality flags
- Covers all life: plants, fungi, animals, bacteria
By the numbers:
- 📦 ~3.7 billion occurrence records
- 🏛 67,000+ contributing datasets
- 🌱 +1.8 million species with at least one record
- 📅 Records dating back to the 1600s (digitized herbarium specimens)
What is it used for?
Species distribution modeling, biodiversity assessments, conservation planning, climate change impact studies, invasive species monitoring.
Known limitations:
Records are not evenly distributed — Europe, North America, and Australia are heavily overrepresented. Tropical regions with the highest diversity often have the fewest records. Older records may have coordinate errors or outdated taxonomy.
Hands-on
Step 1 — Create your account
- Go to www.gbif.org
- Click Login (top right) → Register
- Fill in your details and confirm your email
- Log in — you are now part of the global GBIF community
Step 2 — Explore the interface
- On the homepage, take a moment to look at the numbers on the screen: How many occurrences are there? How many species?
- Click on Data in the top menu → Occurrences to see the global map of all records
Step 3 — Search for your plant family
- In the search bar at the top, type the name of your chosen plant family (e.g., Orchidaceae)
- GBIF will suggest matching options — click on the family name
- You land on the taxon page. Read the summary: how many species are accepted in this family? How many occurrences are in GBIF?
Step 4 — Explore the occurrence map
- Click on the Occurrences tab
- Look at the map — where are most records concentrated? Are there large gaps?
- Zoom into a region that interests you
- Click on a cluster of dots to see the individual records
Step 5 — Filter by country
- On the left panel, find the Country or area filter
- Select a country of your choice (e.g., Colombia, Germany, Brazil)
- How many records are there now? How does that compare to the global count?
- Try filtering by Basis of record: what is the difference between a Human observation and a Preserved specimen?
Step 6 — Inspect a single record
- Click on any single occurrence point on the map
- Open the full record. What information does it contain?
- Look for the Issues and flags section — are there any data quality warnings?
- Can you find the original data source (which institution or project published this record)?
2.4.5 Part II: Taxonomic & floristic databases
What are they?
Taxonomic databases provide authoritative information on species names: which names are accepted, which are synonyms, who described the species and when, and what is known about its distribution at a regional or global level. Unlike occurrence databases, they do not track individual sightings — they tell you what a species is and where it belongs.
Major databases of this type
| Database | Focus | URL |
|---|---|---|
| POWO | Vascular plants & bryophytes | powo.science.kew.org |
| World Flora Online | All plant groups | worldfloraonline.org |
| Catalogue of Life | All life | catalogueoflife.org |
| Flora of the World | Detailed regional floras | floraoftheworld.org |
🌿 POWO — Plants of the World Online
What is it?
POWO is maintained by the Royal Botanic Gardens, Kew (UK), one of the world’s leading plant science institutions. It is the global reference for plant taxonomy — the definitive answer to “is this name valid?” for vascular plants, mosses, and liverworts.
What kind of data?
- Accepted species names and all their synonyms
- Original publication (who described the species and when)
- Native and introduced distribution by country and botanical region
- Images and links to related resources
By the numbers:
- 🌿 ~350,000 accepted plant species
- 📚 Covers all vascular plants (flowering plants, ferns, conifers) and bryophytes
- 🗺 Distribution data for ~200,000 species
What is it used for?
Checking whether a species name is valid, finding the accepted name when a synonym is used in old literature, understanding native vs. introduced ranges, and taxonomic research.
Known limitations:
Distribution data reflects botanical knowledge (based on floras and monographs), not individual observations — it shows whether a species is known to occur in a country, not where specifically.
Hands-on
(No account needed — just browse)
Step 1 — Search for your plant family
- Go to powo.science.kew.org
- In the search bar, type the name of your family (e.g., Orchidaceae)
- Click on the family name in the results
Step 2 — Explore the family page
- How many genera does the family contain?
- Choose on genus — how many species does the genus contain?
Step 3 — Explore a species page
- Click on any species within your genus of interest
- Look at the Distribution tab — where is this species native? Has it been introduced anywhere?
- Compare the POWO native range map with the GBIF occurrence map for the same species — do they match? What differences do you notice?
Step 4 — Investigate synonymy
- On the species page, look for the Synonyms section
- How many synonyms does this species have?
- Why might a single species have many different names? Discuss.
2.4.6 Part III: Functional trait databases
What are they?
Functional trait databases record measurable characteristics of organisms — leaf size, plant height, seed mass, root depth, wood density, and hundreds of other traits that influence how a plant grows, reproduces, and interacts with its environment. These traits connect species identity to ecological function.
Major databases of this type
| Database | Focus | URL |
|---|---|---|
| TRY | Global plant traits | try-db.org |
| LEDA | NW European plant traits | leda-traitbase.org |
| AusTraits | Australian plant traits | austraits.org |
| GIFT | Global floristic traits | gift.uni-goettingen.de |
🌱 TRY Plant Trait Database
What is it?
TRY is a collaborative research network and database that compiles plant trait measurements contributed by research groups from around the world. It was launched in 2007 and has grown into the most comprehensive repository of plant functional traits available.
What kind of data?
- Measurements of >700 different plant traits
- Data contributed directly by researchers (not aggregated from citizen science)
- Each record includes: species, trait, measured value, unit, location, and the contributing dataset
By the numbers:
- 📊 >15 million trait records (2024)
- 🌿 >280,000 plant taxa with at least one trait record
- 📏 >700 traits (from leaf nitrogen content to stem hydraulics)
- 👥 Data from >200 contributing research groups
What is it used for?
Functional ecology, trait-based community assembly studies, global biogeographic analyses, and Earth System Models that need plant functional parameters.
Known limitations:
Coverage is very uneven: well-studied groups (e.g., temperate grasses, European trees) have thousands of records, while many tropical families are barely represented. Some traits are far better measured than others.
Note: Downloading data from TRY requires submitting a formal data request (reviewed within a few days).
2.4.7 Part IV: Genetic & molecular databases
What are they?
Genetic databases store DNA and protein sequences, along with metadata about the organism they came from and the study that produced them. They are the foundation of molecular phylogenetics, DNA barcoding, and genomic research.
Major databases of this type
| Database | Focus | URL |
|---|---|---|
| GenBank (NCBI) | All DNA/RNA sequences | ncbi.nlm.nih.gov/genbank |
| BOLD Systems | DNA barcodes | boldsystems.org |
| ENA | European sequence archive | ebi.ac.uk/ena |
| DDBJ | Japanese sequence archive | ddbj.nig.ac.jp |
Note: GenBank, ENA, and DDBJ are part of the International Nucleotide Sequence Database Collaboration (INSDC) — they synchronize daily, so sequences submitted to one are available in all three.
🧬 GenBank — NCBI Nucleotide Database
What is it?
GenBank is maintained by the National Center for Biotechnology Information (NCBI), part of the US National Institutes of Health. It is the world’s largest publicly accessible nucleotide sequence database. Every sequence published in a scientific paper must be deposited in GenBank (or ENA/DDBJ) — making it a direct mirror of published molecular biology research.
What kind of data?
- DNA and RNA sequences from all organisms
- Complete genomes, individual genes, environmental DNA (eDNA)
- For plants: commonly includes rbcL, matK, ITS (used for DNA barcoding and phylogenetics)
- Each record includes: organism, gene/region, sequence, publication, and submitter
By the numbers:
- 🧬 >250 million sequences (2024)
- 📈 Doubles in size approximately every 18 months
- 🌱 Sequences from >500,000 species
- 📅 Data going back to the early 1980s
What is it used for?
Molecular phylogenetics (reconstructing evolutionary trees), DNA barcoding (identifying species from a DNA fragment), population genetics, and identifying unknown specimens.
Hands-on
(No account needed — just browse)
Step 1 — Navigate to the Nucleotide database
- Go to www.ncbi.nlm.nih.gov
- In the database dropdown (top left), select Nucleotide
- In the search bar, type the name of your plant family (e.g., Orchidaceae)
- How many sequences are available for your family?
Step 2 — Filter your search
- In the search bar, next to the genus, add the Gene name — try searching for
rbcL(a chloroplast gene commonly used for plant barcoding) - How many rbcL sequences exist for your family?
Step 3 — Inspect a sequence record
- Click on any result to open the full record
- Find the following information:
- Which species is this from?
- Which gene or genomic region was sequenced?
- Who submitted this sequence, and in which year?
- Is it linked to a published paper? Can you find the paper?
- Scroll down to the actual sequence — what does it look like?
2.4.8 Task
Now that you have explored each database, bring it all together. Choose one plant family (it can be the same one you used above, or a new one) and investigate it across all four databases. Answer the following questions:
- Where does your family occur according to GBIF?
- Are the records from GBIF in the described native distribution of the taxa? (POWO)
- How well is your family represented in the genetic record? (GenBank)
“You have now searched for the same plant family in different databases. Could any single one of them have answered all your questions? What would you lose if you only had access to one?”
2.4.9 Literature
- Chamberlain, S. et al. (2021). rgbif: Interface to the Global Biodiversity Information Facility API. R package.
- Kattge, J. et al. (2020). TRY plant trait database – enhanced coverage and open access. Global Change Biology, 26(1), 119–188.
- Turland, N.J. et al. (2018). International Code of Nomenclature for algae, fungi, and plants. Regnum Vegetabile 159. Koeltz Botanical Books.
- GBIF Secretariat (2023). GBIF — The Global Biodiversity Information Facility. www.gbif.org