Skip to content

'correct' but misleading name-matching results #280

@mjwestgate

Description

@mjwestgate

Some taxa are of conservation importance but are not taxonomically recognised. For example, if we look up the Victorian conservation list:

library(galah)
library(dplyr)

show_all(lists) |>
  filter(isAuthoritative == TRUE,
  region == "Victoria") 
# A tibble: 1 × 22
  species_list_uid listName         description listType dateCreated lastUpdated lastUploaded
  <chr>            <chr>            <chr>       <chr>    <chr>       <chr>       <chr>       
1 dr655            Victoria : Cons… ""          CONSERV… 2015-04-04… 2025-07-08… 2025-07-08T…
# ℹ 15 more variables: lastMatched <chr>, username <chr>, itemCount <int>, region <chr>,
#   isAuthoritative <lgl>, isInvasive <lgl>, isThreatened <lgl>, isBIE <lgl>, isSDS <lgl>,
#   wkt <chr>, category <chr>, generalisation <chr>, authority <chr>, sdsType <chr>,
#   looseSearch <lgl>

Then look up what species are on that list, and filter to those that are a single word:

species_list <- request_metadata() |>
    filter(list == "dr655") |>
    unnest() |>
    collect()

species_list |>
    filter(grepl("^[[:alpha:]]+$", scientificName))

# A tibble: 3 × 6
       id name                      commonName       scientificName lsid      dataResourceUid
    <int> <chr>                     <chr>            <chr>          <chr>     <chr>          
1 6793854 Chiastocaulon biseriale   NA               Chiastocaulon  NZOR-6-7… dr655          
2 6794205 Eucalyptus X oxypoma      Studley Park Gum Eucalyptus     https://… dr655          
3 6795458 Eucalyptus X studleyensis Studley Park Gum Eucalyptus     https://… dr655  

Each of these entries is supplied as a species, but returns a genus. We can confirm this by trying the same query with search_taxa(), e.g.

search_taxa("Eucalyptus X studleyensis")
# A tibble: 1 × 14
  search_term        scientific_name scientific_name_auth…¹ taxon_concept_id rank  match_type
  <chr>              <chr>           <chr>                  <chr>            <chr> <chr>     
1 Eucalyptus X stud… Eucalyptus      L'Hér.                 https://id.biod… genus exactMatch
# ℹ abbreviated name: ¹​scientific_name_authorship
# ℹ 8 more variables: kingdom <chr>, phylum <chr>, class <chr>, order <chr>, family <chr>,
#   genus <chr>, vernacular_name <chr>, issues <chr>

Again, this links the taxon concept to "Eucalyptus", and further describes match_type as exactMatch, meaning we wouldn't normally flag this as an error. The problem, therefore, is that calling this taxon name in a pipe will lead to all Eucalyptus observations being returned, which is almost certainly not what the user wants:

galah_call() |>
    identify("Eucalyptus X studleyensis") |>
    group_by(scientificName) |>
    count() |>
    collect()
# A tibble: 1,193 × 2
   scientificName           count
   <chr>                    <int>
 1 Eucalyptus               44131
 2 Eucalyptus obliqua       43001
 3 Eucalyptus camaldulensis 41418
 4 Eucalyptus sieberi       25192
 5 Eucalyptus melliodora    24164
 6 Eucalyptus crebra        22957
 7 Eucalyptus globoidea     22700
 8 Eucalyptus macrorhyncha  21812
 9 Eucalyptus tereticornis  21447
10 Eucalyptus muelleriana   18830
# ℹ 1,183 more rows
# ℹ Use `print(n = ...)` to see more rows

So, in summary, sometimes the ALA returns poorly targeted information that is technically correct but not useful, and doesn't provide any flags (such as match_type) that we would usually reference to identify undesirable behaviour.

One solution might be to show the user what taxon rank is being returned by search_taxa(), for example by grouping by the rank column:

search_taxa("Eucalyptus X studleyensis") |>
    group_by(rank) |>
    summarize(count = n())
# A tibble: 1 × 2
  rank  count
  <chr> <int>
1 genus     1

This wouldn't help much in piped queries, but for taxon queries it might highlight unexpected behaviour. It would probably need to be controlled by the verbose argument of galah_config().

Metadata

Metadata

Assignees

No one assigned

    Labels

    discussionUnclear what is the best way forward

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions