I think the main issue at hand is that the unique identifiers we attach to the actual specimen need to be the unique value we want to report in publications and be able to find in some other system again.
The Darwin Core Triplet of institutionCode, collectionCode, + catalogNumber was thought to be that kind of identifier a while back and I think it needs to stay that way. Important is that we as curators/collection
managers/data managers/museum technicians/researchers actually physically assign such a number to the specimen (for us at the Smithsonian it’s USNMENTXXXXXXXX - no hyphens/dashes, no spaces, including all leading zeros) and use it to refer to the specimen
in our data platforms, publications (material examined lists), and other outlets as the primary unique identifier. These numbers are/need to be unique within institutions (and not just collections) and be used in full (with institutionCode USNM + collectionCode
ENT + all leading zeros from the catalogNumber component) as THE catalogNumber in
all systems. While these identifiers are technically not universally or globally unique (as they don’t follow those guidelines of 32 hexadecimal characters and 4 hyphens - https://en.wikipedia.org/wiki/Universally_unique_identifier), they are globally unique
because there should not be another museum using the acronym USNM. Institutions need to be unambiguous and GRSciColl from GBIF (https://www.gbif.org/grscicoll)
is now such a reference that we all need to adhere to.
Not being on the technical side of databases, I would argue that a USNMENT01234567 number is globally unique
for our purposes of uniquely identifying a specimen and trace it back to an institution/collection, collector, identifier, database record, images, and other resources outside of the owning institution to provide more data etc.
At the USNM, the informatics group initiated an EZ-ID ARK system (a true GUID/UUID) for all of our EMu records several years ago and now there are three unique ways of how to find the record: USNMENT00870165 number,
EZ-ID (e.g. 5665/3c4450a49-cf70-40da-bfd1-e2a1ad322163, http://n2t.net/ark:/65665/3c4450a49-cf70-40da-bfd1-e2a1ad322163), and an internal IRN (internal record number, not visible to the public). The USNM uses the EZ-ID as the occurrenceID for the upload to
GBIF and because it is an ARK system (similar to a DOI) it will actually point back to the record on our platform. GBIF assigns its own number (1321640540, https://www.gbif.org/occurrence/1321640540) to this record/specimen on their portal. However, only the
USNMENT00870165 is on the physical specimen and is available to anyone studying the specimen in the collection, on loan, or through images. It’s the only number that should be published in a taxonomic revision or any other outlet because it is physically there
and will never change/be changed.
David Shorhouse through Bionomia has shown several times how data providers (museums, institutions, collections) change the occurrenceID and therefore mess up the way Bionomia can make sure that a certain
record has been previously attributed to a collector/identifier and does not need to be looked at again.
Personally, I wish the USNM would have used the USNMENT00870165 number as part of the globally unique EZ-ID (or whatever it would be called then)
because in that case the number attached to the specimen would be included in a GUID/UUID and occurrenceID for GBIF and any regular Google search with that number would have pointed to the actual specimen record.
Systems need to make their own internal IDs for records to work best with their architecture and that’s just fine, but I think the focus for making data available to users (taxonomists, scientists in other
fields, Bionomia, GBIF, …) need to be the unique specimen identifier and not some occurrenceID, internal ID, or some other fancy way of portraying universal access when nobody can associate that universally unique number to the specimen without having access
to the system on which the data are stored. The Digital Specimen (DiSSCO, https://www.dissco.eu/what-is-dissco/technical-infrastructure/) or Extended Specimen (https://doi.org/10.1093/biosci/biz140) initiatives need to take that into account and as David highlighted,
there might be existing Darwin Core terms available. On the other hand, a unique catalogNumber as outlined above should be the one and only unique identifier to be used for a Digital Specimen or Extended
USNMENT00870165 is the number assigned to a specimen I collected 22 years ago and it will never change. The occurrenceID in EMu will not change but what happens should the USNM decide to change their database
system? Let’s all use a globally unique specimen identifier such as the one above following the Darwin Core Triplet to refer to the specimen and a way to communicate about this particular fly.
p.s.: to those looking at the examples in detail, the catalogNumber for the fly reported at GBIF is missing the leading zeros - I am aware of that and I have pointed this out to our informatics group a
. . . . .
Torsten Dikow, Ph.D.
Research Entomologist for Diptera
NATIONAL MUSEUM OF NATURAL HISTORY (USNM)
When a book is shipped to a library, the library assigns it a call number, puts a sticker on it and puts it on the shelf. The librarian doesn't stand back and say "the manufacturer didn't ship this book with a call number, therefore
I can't assign it one!" They also don't say, "well some of our books are Dewey, and others are Library of Congress, because that’s how they were shipped to us, nothing we can do about it."
It does seem, however, that SCAN does assign a unique number to each of the entries, “recordId”. BUT apparently there is no information available anywhere (no one has directed me to a webpage, or document) that explains what “recordId”
is, where it comes from, or what it's used for. To me those column headers (concepts) are incredibly important, and the fact that there isn't a basic page or document that explains each one is very troubling.
I've been waiting for someone to point out that adding “recordId” to the search page would solve my problem. Amazingly, no one has mentioned or even imagined it!
We’ve poured millions into these databases, and suffered through endless talks at ECN and ESA, there have been how many conferences and talks, webinars, etc. for more than a decade. I appreciate that it’s a complicated issue,
but at its base it’s a question of inventory. Each item that’s inventoried had a manufacturer, came from somewhere when, is somewhere now, might be moving to somewhere later. Each item is composed of individual parts, might be broken into smaller parts, and
those parts might not all stay in the same place. Wal-Mart, Bowing, Amazon, the US military from the Civil War to today have all had to deal with that. Ford Motor Company, in the space of less than a decade, had to create all that from the ground up, with
I ask basic questions that can’t be answered and have basic needs that can’t be met, and frankly I’m offended. I expected things to be better than this. It feels like the group is suffering from an enormous amount of groupthink
or confirmation bias. How often does an outsider get invited to evaluate the system?
I thank you for your help on this, but I hope you can see, I’m in a bind here. Someone want’s to cite specific entries in SCAN in a publication, and it can’t be done in a reasonable fashion. I’m going to have to tell the authors
to download all the data, report “collection”, “catalogNumber” and “recordId” in their paper. Then the authors have to instruct any readers of the paper, that if those readers want to find those entries, the reader will have to download all the data for the
taxon from SCAN, then search within it and match all three of those, because I still don’t have confirmation that “catalogNumber” is unique and stays with the specimen.
The system isn’t much better than when someone just transcribes an entire label and publishes it as text in a document.
On Fri, Mar 19, 2021 at 7:40 PM Andrew Johnston <[log in to unmask]> wrote:
One point about SCAN is that it is intentionally set up as a hybrid aggregator and primary digitization platform. One of the required pieces of information to create a collection within the SCAN portal (and all other symbiota portals) is the GUID source. So
all of the collections that "live manage" their data typically have a SCAN-generated GUID in the occurrenceID field. The examples you cite are from collections that are registered in SCAN as "snapshot" collections. Each of these profiles were set up to identify
where their GUIDs/occurrenceID's come from - and they should be supplied by the collection providers just like they would be on an upload to GBIF.
So what you are observing in SCAN is the same issue that David is talking about within GBIF and other aggregators - the aggregated data is only as complete as the data supplied to it. We definitely need better standards across biodiversity data and for this
to be more standard training with collections and data managers and might make a good symposium at a future ECN meeting. (I am particularly curious now about occurrenceID vs materialSampleID which is a new term for me). I'd also agree with David that a truly
globally unique occurrenceID should be created by the collection at the first instance of making an occurrence record and that should be propogated, unchanged, throughout all other instances of that record online.