It's been some time since I subscribed to ENTOMO-L and I just renewed it. Hello again!
I've been reading your thread about databases and collector names with interest. There's a spectacular diversity of solutions in the wild, each playing against a tension between ease of data entry and maddening comprehensiveness.
Here are some of my thoughts on this & I hope some of it proves useful. Christy prefaced with, "Quick question". My answers are anything but. First, some direct comments.
Rachel - You wrote, "The wide date range on that MCZ Lep is an artifact of our data entry. The specimen had no verbatim date, so it got the our default date range in data entry. Unfortunately it's too time-costly to ask data enterers to do this sort of research and date restriction upon initial data entry -- much more efficient later on when you have aggregated events and localities to clean in bulk."
Any chance you can chat with Paul Morris and request that these default event dates not flow out into the wild? That's exactly how your eventDate values appear at GBIF and it causes all kinds of grief for folk trying to compute on them or to use them to help disambiguate agents. Better to share these as empty fields, no?
Sounds like you'd like a people names / identity gazetteer. Couldn't agree more. Let's make one! We have 1.8M-ish unresolved agent strings in GBIF if we take these at face value, https://bloodhound-tracker.net/agents
. This, after much repeat gymnastics in parsing both recordedBy and identifiedBy in the specimen data they aggregate. These can be reconciled to perhaps three quarters of that number if we had a way to easily and universally say this "M. Smith" is not the same as that "M. Smith". The only way to do that is with associated information: taxa of specialty, publications, demographics, and the like. A single collection in isolation from others cannot do this. And yet, it's the regional collection where a good portion of the knowledge about their collectors resides. This is a nice argument for why we have aggregators and networks of collaboration. Nicky Nicolson at Kew has been doing some great work using Machine Learning to cluster agent string. I'm not sure how well it translates to Entomology. I tend to take a semi-automated approach.
If we look at raw numbers of agent strings and specimens at scale, we have a typical J-shaped collector curve with specimens collected by folk like Dan Janzen, Bob Anderson, Stuart Peck, Steve Ashe, Sam Droege, Stuart Fullerton, Zach Falin, and a handful of others dominating the far-left, and countless folk in a very long tail of singleton specimens. Resolving people names to an authority (or authorities) in that long tail requires local knowledge. This also gets us very quickly into the world of genealogy. Here is where people like Norm Johnson, Deb Paul, and Neal Evenhuis in our midst are good friends to have around.
Any among us doing resolution of agent names as a form of outreach? We should!
Christy et al.
re: Modelling verbatim agent strings
The first question I ask is, "Why do you want to do this?" Besides a noble pursuit of data integrity – everyone wants clean(er) data – what are the reasons for trying to get a handle on people names (+ collecting parties, organizations, other "agents")? What's to be gained locally, regionally, nationally, internationally when strings of aliases are strung to preferred representations of people names, whatever that means? It must be because we expect or are promised other kinds of metadata and connections, right? I use "connections" loosely to mean both technical connections (i.e. cross-references with dates of birth, publication, death for reasons of data quality), and far more interesting social connections with folk peripheral to our community. The promise of telling stories is a significant part of why we have collections. Knowing "who" is just as necessary to the story as "what" or "where".
Is there any point in modelling agents and their aliases if these are not functionally connected to anything else outside the confines of the cms? And, as a corollary question, is there any point in linking our agents to an external authority if that authority says nothing useful, is static, or worse, cannot be fixed? [More on that re: ORCID below]
Wikidata is such an obvious answer to all this & a lot of us are making use of it to varying degrees. I think of it as a free broker of all possible linkages. Link to it and the world opens up to a big place. Make a stub item and others can (and will!) flesh out the gaps.
Then there's the thought process that maybe what we're doing by modelling people names and their aliases is to properly & thoroughly credit folk for their specimens as products of research. We should be credited for our specimens! If that's part of the motivation in modelling agents in our databases, maybe we do need a way to resolve identities to something other than wikidata. Something already (somewhat) embedded in the academic landscape & that at least has the potential to stitch specimens to downstream publications (taxonomic or otherwise). This is where ORCID may be useful. But as I wrote earlier, ONLY if these ORCID accounts have SOMETHING of use some in them to help our future selves disambiguate the people we've forgotten.
If you just made an ORCID, please log in anew and link a pub, add your affiliation, education, ANYTHING that says who you are. Eventually, our ORCID selves will be linked on our wikidata selves.
Tommy's depiction in TaxonWorks illustrates an excellent way to model this. Have verbatim fields and link to something outside the cms because someone other than you will enter similar data. I'd also expect that TaxonWorks would record for example that it was Tommy who made the assertion that an "M. Thayer" on a verbatim det. field is "Margaret Thayer". That builds trust and transparent accountability. A 1:many agent:alias arrangement is fine, but you'll perhaps need a way to store metadata on the edges (eg X is a maiden name of the canonical Y). And then there are those aliases to reconcile! This is a slippery slope. We can probably accomplish most of what we need by flattening this out as a search problem + using wikidata to resolve against an authority when we can, then drawing-in those aliases to enrich search once more.
As for sharing data, the Darwin Core terms recordedBy and identifiedBy ARE verbatim. As Torsten points out, GBIF has these very new identifiedByID and recordedByID terms in which we are meant to put URIs to externally recognized identifiers (in the informatics sense) but confusingly, they are NOT Darwin Core terms. The standard has not been formally altered to accommodate them. These are reasonably good short-term, band-aid terms that GBIF created and made available in their Integrated Publishing Toolkit, primarily as a means to gauge interest. They are a little wanting when it comes to handling multiple collector & determiner names - we're meant to separate these with vertical bars -, collecting parties or expeditions, or more abstract representations & groupings of people, eg. "Mrs. Smith's Grade 4 class". I'd like to see an alternate, more accurate way to represent "agents" as well as a cleaner separation of agents from the actions they executed.
A group of us in the TDWG community are working on an extension to Darwin Core called "Agent Actions" in which a controlled vocabulary + definitions of action verbs will be part of its remit: collected, identified, georeferenced, measured, etc. There will also be accommodation for ordering of people names per specimen, each with external identifiers. Some of us might make quick use of this extension to Darwin Core, most of us will not in the short term. At the very least, I'm hopeful it will play a part in allowing our stories about the characters in our collections to escape, be computationally useful, & become enriched with stories shared by others well outside our narrow collections community.
I warned you that this wouldn't be quick,
When you are databasing specimens, if the label has a collector's initials, or some other shortened version of their name, but you know who it is, do you transcribe into the database who you are
99% sure the collector is or do you faithfully record the label data exactly as is?
We are using Symbiota/SCAN and I don't see a field where it would be appropriate to make a notation that we made that assumption (like "verbatim collector data").
Is this just me? tbh, I'm finding it difficult to concentrate since this has all gone down!
Happy to get replies just to me so we don't clutter this list.
Invertebrate Collections Manager
Natural History Museum of Utah
301 Wakara Way
Salt Lake City, Utah 84108