Arctos Data Quality Checks, Reports, and Tools
Arctos includes built-in checks, reports, and tools for creating and maintaining high quality data. Checks prevent the addition of low quality data and reports and tools detect problems with data after it has been entered. This document provides an overview of the data quality checks, reports, and tools available.
Data Quality Checks
When using the single record data entry form or bulkloader, the following checks occur at the point of data entry and must be resolved before a record can be saved.
Dates
Dates must be in the ISO format. End dates must be after begin dates anywhere there are two dates provided. Dates are always entered as a single value. Components (year, month, day, time) are extracted at the time of request, never stored. Future dates of collection (dates that fall after the current date) are not allowed.
- iDigBio Data Quality Toolkit: Date hasn’t happened yet
- iDigBio Data Quality Toolkit: Year, Month, and Day values do not match date
Nonprinting Characters
No fields may include a non-printing character, leading spaces, or trailing spaces.
- iDigBio Data Quality Toolkit: Incorrect character encodings
- iDigBio Data Quality Toolkit: Incorrect line endings
Catalog Numbers
Catalog numbers must match the expected format for the collection and may not already exist in Arctos. Duplicate catalog numbers are not allowed in Arctos. Any duplicate of an existing number will generate and error and fail to upload.
Pro Tip Collections using the integer catalog number format can leave catalog number blank and Arctos will assign the next available integer catalog number to the record as it is loaded.
Basis of Record
Basis of record is required in Arctos and must match a controlled vocabulary that includes the terms expected in the DarwinCore Archive prepared for GBIF. Collections can select a preferred value and if left blank during data entry the preferred value will be automatically used.
Accession
Every record must be associated with a pre-exiting accesion.
Agents
Any field that accepts an agent must include a value that matches exactly one Agent in Arctos. This includes collectors, preparators, creators, identification determiners, attribute determiners, participants in transactions and publications, and creators of media.
Code Tables
Arctos has a published list of acceptable terms for many fields Code Tables. Any field that accepts values from any of these code tables, must match a term in the table that is accepted by the collection in which the record is being entered.
Identification (Taxon Names)
Identifications in Arctos can be made in several formats, however, they all must include a reference to at least one term from the Taxon Name Table. This table is maintained by Arctos Operators with manage_taxonomy permissions and is not guaranteed to exclude misspellings or errors, but when these are discovered, there are paths for linking poorly formatted names to the correct version and/or quaratining such names from use while still allowing them to be present for the purposes of search and discoverability.
Higher Geography
Higher geography in Arctos is a controlled vocabulary composed of terms from GADM and IHO World Seas supported by shapes. Higher geography must match a term in this vocabulary, so any “misspellings” would be intentionally matching the relevant authority.
Elevation and Depth
Lowest elevation or depth cannot be more than highest and elevation values are constrained to avoid elevations or depths not possible on Earth.
- iDigBio Data Quality Toolkit: Minimum and maximum elevation values mismatched
- iDigBio Data Quality Toolkit: Elevation is unlikely
Georeference
Latitude and longitide must either both be NULL or both include a value.
Datum must be supplied with coordinates, but cannot be supplied without them. In addition, georeference protocol and georeference error cannot be supplied without coordinates, although coordinates can be supplied without them. All spatial data are converted to WGS84 and datum is explicitly provided. Input datum is also retained.
- iDigBio Data Quality Toolkit: Missing geodetic datum
- iDigBio Data Quality Toolkit: Georeference metadata with no associated georeference
Coordinate values are datatyped to disallow invalid entries.
Data Quality Reports and Tools
Dates
Many legitimate very old dates exist, however a date of collection or identification before the birth date of the collector or determiner will trigger a data quality notification in Arctos.
Arctos supports more than collecting, so something may legitimately be identified (as in an observation) prior to being collected, however, there is a curatorial report that flags this situation for review.
Agents
Agent pages include a list of potential duplicates.
Locality
Higher geography in Arctos is a controlled vocabulary of data objects associated with spatial polygons. Components are extracted on demand, never stored. Assigned coordinates plus error that do not fall within the higher geography polygon for any location generate a data quality report for all collections using the locality. This clearly highlights improper negation as well as coordinate/geography mismatches.
- iDigBio Data Quality Toolkit: Lower geography values are provided, but no higher geography
- iDigBio Data Quality Toolkit: Mismatched geographic terms
- iDigBio Data Quality Toolkit: Coordinates do_not fall within the named geographic unit
- iDigBio Data Quality Toolkit: Improperly negated latitudes/longitudes
- iDigBio Data Quality Toolkit: Coordinates are zero
- Such a place exists and these coordinates are acceptable, however, if they do not fall inside the associated higher geography polygon, a data quality report will be generated.
- [iDigBio Data Quality Toolkit:Mismatched Country and Country Code]https://www.idigbio.org/wiki/index.php/Arctos_Data_Quality_Toolkit#Mismatched_Country_and_CountryCode_Values)
- countrycode isn’t part of Arctos (because adding it would in many cases introduce unnecessary ambiguity)
Taxonomy
Taxon pages in Arctos include external validation through comparisons with select taxonomic authorities including the World Register of Marine Species (WoRMS), Encyclopedia of Life (EOL), the Global Biodiversity Information Facility (GBIF) and Wikipedia among others. This tool is also engaged whenever a new name is added to the taxonomic name table to help avoid the addition of mispellings and misused names. The Taxonomy Gap tool in Arctos allows for review of taxonomic classifications with missing terms (Family, Order, etc.) or with no associated local classification. Arctos also pulls data from GlobalNames so records are generally still discoverable even when local taxonomic sources are missing terms or entire classifications.
Curatorial
Individual Count
Individual count is a curatorial assertion, there are no constraints.
Edit this Documentation
If you see something that needs to be edited in this document, you can create an issue using the link under the search widget at the top left side of this page, or you can edit directly here.