library(ggplot2)
library(dplyr)
library(arrow)
library(dbplyr, warn.conflicts = FALSE)
library(duckdb)
library(terra)
library(mapview) # for interactive maps
# set ggplot theme of choice
theme_set(ggpubr::theme_pubr())
Using the iNaturalist Canada parquet file
1 What is a parquet file?
For this working group, iNaturalist Canada provided us their data as a parquet file. Parquet is an open source file format that compresses and encodes large datasets for more efficient data storage, and importantly, more efficient retrieval of informaiton within the dataset. This format makes data lighter to handle because it is column-oriented, rather than row-oriented like a CSV file (which quickly gets overwhelming when you have many rows!).
This little tutorial will show you how to open the parquet file if you would like to explore it yourself. As a reminder, we have already made density maps from this file, so you may not need to explore it (depending on the challenge you’d like to build)!
First, load the R packages you will need to use to access the density map layers:
Open the parquet file
Next, we will open the parquet file using the package arrow
:
<- arrow::open_dataset("~/McGill University/Laura's Lab_Group - BioBlitz/data/raw/biodiversity-data/inat-canada/iNat_non_sensitive_data_Jan2025.parquet") inat_pq
Note that this isn’t the same as loading iNaturalist Canada as a CSV in your R session. Here, we’re establishing a connection to the database so we can make queries for the data that we really want.
Take a look at the columns that are available in the file. We’ll use these columns to query the data with dplyr
later:
$schema inat_pq
Schema
OBJECTID: int64
id: int32
observed_on_string: string
user_login: string
quality_grade: string
license: string
url: string
tag_list: string
description: string
num_identification_agreements: int32
num_identification_disagreements: int32
captive_cultivated: string
latitude: double
longitude: double
positional_accuracy: int32
public_positional_accuracy: int32
geoprivacy: string
taxon_geoprivacy: string
coordinates_obscured: string
positioning_method: string
positioning_device: string
place_town_name: string
place_county_name: string
place_state_name: string
place_country_name: string
place_admin1_name: string
place_admin2_name: string
species_guess: string
scientific_name: string
common_name: string
iconic_taxon_name: string
geometry: binary
See $metadata for additional Schema metadata
Query the data
Next, we’ll query the data we want, and we will finish the query with the collect()
command to load the query’s result into this R session.
Here, we’ll summarise the number of observations per species, per group using dplyr
(but you can use other dplyr
functions like filter()
for example!)
<- inat_pq |>
query
# E.g. you could filter by species group:
# filter(iconic_taxon_name == "Aves") |>
# Summarise number of obs per species, per species group
group_by(iconic_taxon_name, scientific_name) |>
summarize(total_obs = n()) |>
# load the query into our R session
collect()
Let’s have a look at the distribution of the number of observations per species, per group. We’ll put a vertical line to show the 100 observations/species mark, which we’ll dive into soon.
ggplot(data = query) +
geom_density(aes(x = total_obs,
fill = iconic_taxon_name),
alpha = .5, lwd = .1) +
geom_vline(xintercept = 100) +
scale_x_log10() +
labs(x = "Number of observations per species",
y = "Density (of species)",
fill = "Species group") +
theme(legend.position = "right")
Let’s summarise our initial query to the species group level, to see which groups have more samples:
= query |>
summary_query group_by(iconic_taxon_name) |>
summarise(total_obs = sum(total_obs),
median_obs = median(total_obs))
This is just a few lines of code to reorder the species group names according to the number of observations, for prettier plots:
$iconic_taxon_name = factor(query$iconic_taxon_name,
querylevels = summary_query$iconic_taxon_name[order(summary_query$median_obs)])
$iconic_taxon_name = factor(summary_query$iconic_taxon_name,
summary_querylevels = summary_query$iconic_taxon_name[order(summary_query$total_obs)])
Which groups have more total samples?
ggplot(data = summary_query) +
geom_bar(aes(y = iconic_taxon_name,
x = total_obs,
fill = iconic_taxon_name),
stat = "identity") +
theme(legend.position = "none") +
scale_x_log10()
Which species have < 100 observations?
Species with under 100 observations are of special interest to iNaturalist, so let’s have a look at which species have fewer than 100 observations.
First, let’s filter our initial query to only retain species with < 100 observations. Here, I removed species with < 5 observations to get rid of some species with mistakes in their names, or that were just higher-level classes, etc. (But you can remove this condition too to make sure you’ve got everything).
= query |>
query_under100 filter(total_obs < 100 & total_obs > 5)
How many species in each group have fewer than 100 observations?
|>
query_under100 group_by(iconic_taxon_name) |>
summarise(nsp = n_distinct(scientific_name))
# A tibble: 14 × 2
iconic_taxon_name nsp
<fct> <int>
1 Protozoa 159
2 Chromista 240
3 Actinopterygii 350
4 Reptilia 68
5 Mollusca 577
6 Amphibia 37
7 Animalia 1246
8 Arachnida 559
9 Mammalia 284
10 Fungi 3045
11 Aves 577
12 Insecta 9484
13 Plantae 5781
14 <NA> 91
Let’s have a look at the number of observations per species for these under-sampled species.
ggplot(data = query_under100) +
geom_histogram(aes(x = total_obs,
fill = iconic_taxon_name),
bins = 10) +
theme(legend.position = "none") +
facet_wrap(~iconic_taxon_name, ncol = 2) +
labs(x = "Number of observations per species",
y = "Number of species")
We can see that for groups like fungi, insects and plants, there are many species with less than 30 observations!
Convert the parquet query into a vector and a raster layer
Let’s make the iNaturalist query into a spatial object. First, we’ll make a vector layer, which we’ll then transform into a raster layer.
For the example, we’ll only do this for plants:
<- inat_pq |>
obs # filter to plant species with < 100 observations:
filter(scientific_name %in% query_under100$scientific_name,
%in% "Plantae") |>
iconic_taxon_name # select columns we want to keep
select(c(longitude, latitude,
iconic_taxon_name,
scientific_name, |>
coordinates_obscured)) collect()
Next, we need the polygon of Canada and a base grid that we’ll use to make our raster.
# Load Canada polygon
= sf::read_sf("~/McGill University/Laura's Lab_Group - BioBlitz/data/raw/base-layers/canada-polygon/canada.outline.shp")
canada
# Read in base grid layer to make the raster
.50k = terra::rast("~/McGill University/Laura's Lab_Group - BioBlitz/data/raw/base-layers/canada.base.5k.tiff") |>
base# convert to a coarser resolution for this example
::aggregate(fact = 10) terra
Next, we convert the iNaturalist data to a vector layer and make sure to project it and to crop out any points that are not within Canada (just in case):
# convert the iNaturalist ob
= terra::vect(obs, geom = c("longitude","latitude"),
inat.v crs="+proj=longlat +ellps=WGS84")
# project to match the base grid
= terra::project(inat.v, crs(base.50k))
inat.v # crop observations outside of Canada
= terra::crop(inat.v, canada) inat.v
Now we can convert this vector into a raster of the density of observations of plant species that have < 100 observations in Canada:
# make iNat density layer
= rasterize(inat.v, base.50k, fun = "length")
rareplant.density
# plot it!
= viridis::turbo(5)
pal mapview(rareplant.density,
col.regions = pal,
layer.name = "Density",
na.color = "transparent")