How To Manage Taxonomy Hierarchically

Background

Hierarchical data

Hierarchical data are those in which each term is unique and each term has zero or one parent terms.

Structure of taxonomy

Taxonomy (as a body of literature) is not hierarchical. The “parent” of Echidna (a homonym) varies according to usage, and includes “Muraeninae” (eels) and “Viperidae” (snakes). The “parent” of “Neotoma” has changed with our understanding of evolutionary history, and has at times and according to various authors included “Cricetidae” and “Muridae.”

Structure of Arctos

The core taxonomy structure of Arctos is built to hold all of the above data and anything else which might have been considered “taxonomy” at some point. Arctos taxonomy consists of grouped, sometimes named, sometimes ordered terms. There is no enforced consistency across or within taxon names; the structure which allows varying taxon concepts also allows inconsistent data (which are programmatically indistinguishable from taxon concept assertions).

Hierarchical data in Arctos

Arctos provides a hierarchical taxonomy management tool suitable for consistently managing large groups of taxa as hierarchical data. Hierarchical data are those in which each term is unique and has zero or one parent terms. That is, the term “Neotoma” can exist precisely one time in a hierarchy. A term can have any number of children (“Neotoma albigua”, “Neotoma cinerea”, etc.), and those terms may have children of their own (“Neotoma albigula albiglua”). “Neotoma” must have zero or one parent terms - “Cricetidae” or “Muridae.” There is no possibility of Neotoma having two parent terms within the same hierarchy.

Finding consistency

Data in Arctos are seldom consistent even at very small scales. For example, the Arctos classifications for Neotoma devia (http://arctos.database.museum/name/Neotoma%20devia):

screen shot 2018-09-14 at 9 33 06 am

and Neotoma devia monstrabilis (http://arctos.database.museum/name/Neotoma%20devia%20monstrabilis):

screen shot 2018-09-14 at 9 33 52 am

diverge at the level of subfamily.

This inconsistency is not possible in hierarchical data, so the import process will attempt reconciliation. This will very likely result in orphaned ranks.

screen shot 2018-09-14 at 9 35 28 am

In this example, the processor encountered subfamily “Neotominae” first and so “Sigmodontinae” has been ignored. It is necessary to reconcile these discrepancies before exporting the data back to Arctos.

Dealing with these inconsistencies is a very large proportion of the work involved, and the source of all data lost in the hierarchical editor. For collections which can manage specimen data under a hierarchical taxonomy, we highly recommend avoiding sources which are edited with tools that allow the existence of inconsistent data.

Impacts

Changing the subfamily of Neotoma in the Arctos single-record editor is a hit-or-miss prospect. A user would have to find all records which are their idea of Neotoma (e.g., exlude those which are homonyms or hemihomonyms), update each individually, and hope that the process of editing 145 records has somehow not introduced other inconsistencies.

A request for DBA update would involve a non-taxonomist attempting to match strings in a system with tens of thousands of known homonyms. This has not worked well in the past.

The classification bulkloader can work, but runs the same risk of encountering [hemi]homonyms as the DBA update. Finding the records to update becomes increasingly difficult when the names do not share strings - finding Neotoma is straightforward, but finding all genera which should be under Neotominae is effectively impossible due to variations in the data.

Importing data to the classification bulkloader comes with all of the above difficulties in finding data, but unlike other tools the editor provides reports for missed or inconsistent data. Orphaned terms provide another strong indication of inconsistency. For example, finding “Arachnida” somewhere in the Neotoma data might indicate that a homonym has been used for identifications in unexpected ways. (The classification bulkloader would simply over-write the spider classifications, potentially altering specimens in unrelated collections.)

Importing clean data - that which has not been edited by non-hierarchical tools - to the hierarchical editor involves only providing a “seed” (eg, “Neotoma” or “Cricetidae”) and clicking a button.

Changing the family of Neotoma in the hierarchical editor involves only dragging the term “Neotoma” to a new parent.

A user attempting to find specimens amongst inconsistent data will almost certainly fail without knowing they’ve failed. For example, a search for “Neotominae” against the classification data above will find Neotoma devia but not Neotoma devia monstrabilis. Most users will not realize that they’re missing a subspecies, but instead will assume that what they’ve found is all that’s available from Arctos and move on.

A user attempting to find specimens amongst consistent data will find all or nothing. A search for “Neotominae” against consistent data will find either all Neotoma or none. A user finding no specimens will generally realize that their result is unlikely and continue looking (perhaps by visiting a taxonomy page before performing another search).

Using the Hierarchial Editor

Overview

Note that the Hierarchical tool can only be used to structure classifications in Arctos and Arctos Plants. It cannot be used to clean up classifications in the source WoRMS (via Arctos) which is managed externally by the World Register of Marine Species.

  1. Create a dataset (hierarchy name). This is a group of names which are managed together. Datasets are wholly arbitrary “working groups” of names which may be created for any reason and deleted at any time without affecting any “read” data. Datasets do not necessarily need to correspond with Sources in core Arctos taxonomy. It is possible, for example, to manage dataset “shrews” and dataset “voles” separately, but repatriate them to core Arctos Source “shrews and voles.” (Higher taxonomy should be carefully coordinated in such a system.)

  2. Find data to import. A common method is to use the “download” link from any Arctos taxon page; it can attempt to build hierarchical data from whatever’s found. Alternatively, a CSV template is availabe, and data from any source may be munged into it.

  3. Deal with errors. Core Arctos data (and taxonomy itself) is not hierarchical, and the download scripts which convert seeded taxa to a hierarchical structure are very likely to encounter errors which should be dealt with early in the process. Fixing errors may be best done by editing single records in Arctos, deleting local data, and re-seeding. Errors may be fixed in the download CSV before loading to the editor, or at any other time.

  4. Manage data hierarchically.

  5. Export your data to the Arctos classification bulkloader.

  6. Review data in the bulkloader (it will be in tabular/spreadsheet format), repatriate to Arctos using the classification bulkloader.

Tips and Tricks

Some operations are easier done in text files. To make a consistent change across a dataset, the following actions may be performed:

  1. Download CSV
  2. Edit using any CSV editor. For example, nomenclatural_code could be added to all records more efficiently in text form.
  3. Delete the contents of the dataset
  4. Upload the modified CSV, continue editing hierarchically

Example: Import from Arctos, export to new Source

Open the hierarchical editor and create a source if you haven’t already.

Screen Shot 2021-04-05 at 8 23 02 AM

Find some taxon and click download

Screen Shot 2021-04-05 at 8 19 21 AM

Use Option Two with your Source

Screen Shot 2021-04-05 at 11 23 08 AM

Back to the hierarchical editor, click…

Screen Shot 2021-04-05 at 11 27 41 AM

First option, choose file

Screen Shot 2021-04-05 at 11 28 38 AM

continue

Screen Shot 2021-04-05 at 11 29 09 AM

It did stuff

Screen Shot 2021-04-05 at 11 29 36 AM

Manage

Screen Shot 2021-04-05 at 11 30 15 AM

Data. Note the outliers - this should be expected, it’s just a reflection of the inconsistent data in the Arctos classification, drag stuff around to rearrange

Screen Shot 2021-04-05 at 11 24 43 AM

To save/repatriate back to “core Arctos”, first navigate back to hierarchical editor home screen, then select a target source and save.

Screen Shot 2021-04-05 at 11 34 14 AM

Export:

Screen Shot 2021-04-05 at 11 35 08 AM

Status will change

Screen Shot 2021-04-05 at 11 50 14 AM

Wait a while (on production, on test run ScheduledTasks/hier_to_bulk.cfm), when status has fully changed again….

Screen Shot 2021-04-05 at 11 51 41 AM

….check classification bulkloader

Screen Shot 2021-04-05 at 11 35 42 AM

It works like other Component Loaders

Screen Shot 2021-04-05 at 11 52 34 AM

Change to autoload….

Screen Shot 2021-04-05 at 11 53 32 AM

…and hang out a while (or on test, run component_loader). When the classification loader has been cleared, the data will be available in the normal place

Screen Shot 2021-04-05 at 11 55 23 AM