It's been some time since I subscribed to ENTOMO-L and I just renewed it.
Hello again!

I've been reading your thread about databases and collector names with
interest. There's a spectacular diversity of solutions in the wild, each
playing against a tension between ease of data entry and maddening

Here are some of my thoughts on this & I hope some of it proves useful.
Christy prefaced with, "Quick question". My answers are anything but.
First, some direct comments.

Rachel - You wrote, "The wide date range on that MCZ Lep is an artifact of
our data entry. The specimen had no verbatim date, so it got the our
default date range in data entry. Unfortunately it's too time-costly to ask
data enterers to do this sort of research and date restriction upon initial
data entry -- much more efficient later on when you have aggregated events
and localities to clean in bulk."

Any chance you can chat with Paul Morris and request that these default
event dates not flow out into the wild? That's exactly how your eventDate
values appear at GBIF and it causes all kinds of grief for folk trying to
compute on them or to use them to help disambiguate agents. Better to share
these as empty fields, no?

Neil -
Sounds like you'd like a people names / identity gazetteer. Couldn't agree
more. Let's make one! We have 1.8M-ish unresolved agent strings in GBIF if
we take these at face value, . This,
after much repeat gymnastics in parsing both recordedBy and identifiedBy in
the specimen data they aggregate. These can be reconciled to perhaps three
quarters of that number if we had a way to easily and universally say this
"M. Smith" is not the same as that "M. Smith". The only way to do that is
with associated information: taxa of specialty, publications, demographics,
and the like. A single collection in isolation from others cannot do this.
And yet, it's the regional collection where a good portion of the knowledge
about their collectors resides. This is a nice argument for why we have
aggregators and networks of collaboration. Nicky Nicolson at Kew has been
doing some great work using Machine Learning to cluster agent string. I'm
not sure how well it translates to Entomology. I tend to take a
semi-automated approach.

If we look at raw numbers of agent strings and specimens at scale, we have
a typical J-shaped collector curve with specimens collected by folk like
Dan Janzen, Bob Anderson, Stuart Peck, Steve Ashe, Sam Droege, Stuart
Fullerton, Zach Falin, and a handful of others dominating the far-left, and
countless folk in a very long tail of singleton specimens. Resolving people
names to an authority (or authorities) in that long tail requires local
knowledge. This also gets us very quickly into the world of genealogy. Here
is where people like Norm Johnson, Deb Paul, and Neal Evenhuis in our midst
are good friends to have around.

Any among us doing resolution of agent names as a form of outreach? We

Christy et al.
re: Modelling verbatim agent strings

The first question I ask is, "Why do you want to do this?" Besides a noble
pursuit of data integrity – everyone wants clean(er) data – what are the
reasons for trying to get a handle on people names (+ collecting parties,
organizations, other "agents")? What's to be gained locally, regionally,
nationally, internationally when strings of aliases are strung to preferred
representations of people names, whatever that means? It must be because
we expect or are promised other kinds of metadata and connections, right? I
use "connections" loosely to mean both technical connections (i.e.
cross-references with dates of birth, publication, death for reasons of
data quality), and far more interesting social connections with folk
peripheral to our community. The promise of telling stories is a
significant part of why we have collections. Knowing "who" is just as
necessary to the story as "what" or "where".

And so...

Is there any point in modelling agents and their aliases if these are not
functionally connected to anything else outside the confines of the cms?
And, as a corollary question, is there any point in linking our agents to
an external authority if that authority says nothing useful, is static, or
worse, cannot be fixed? [More on that re: ORCID below]

Wikidata is such an obvious answer to all this & a lot of us are making use
of it to varying degrees. I think of it as a free broker of all possible
linkages. Link to it and the world opens up to a big place. Make a stub
item and others can (and will!) flesh out the gaps.

Then there's the thought process that maybe what we're doing by modelling
people names and their aliases is to properly & thoroughly credit folk for
their specimens as products of research. We should be credited for our
specimens! If that's part of the motivation in modelling agents in our
databases, maybe we do need a way to resolve identities to something other
than wikidata. Something already (somewhat) embedded in the academic
landscape & that at least has the potential to stitch specimens to
downstream publications (taxonomic or otherwise). This is where ORCID may
be useful. But as I wrote earlier, ONLY if these ORCID accounts have
SOMETHING of use some in them to help our future selves disambiguate the
people we've forgotten.

If you just made an ORCID, please log in anew and link a pub, add your
affiliation, education, ANYTHING that says who you are. Eventually, our
ORCID selves will be linked on our wikidata selves.

Tommy's depiction in TaxonWorks illustrates an excellent way to model this.
Have verbatim fields and link to something outside the cms because someone
other than you will enter similar data. I'd also expect that TaxonWorks
would record for example that it was Tommy who made the assertion that an
"M. Thayer" on a verbatim det. field is "Margaret Thayer". That builds
trust and transparent accountability. A 1:many agent:alias arrangement is
fine, but you'll perhaps need a way to store metadata on the edges (eg X is
a maiden name of the canonical Y). And then there are those aliases to
reconcile! This is a slippery slope. We can probably accomplish most of
what we need by flattening this out as a search problem + using wikidata to
resolve against an authority when we can, then drawing-in those aliases to
enrich search once more.

As for sharing data, the Darwin Core terms recordedBy and identifiedBy ARE
verbatim. As Torsten points out, GBIF has these very new identifiedByID and
recordedByID terms in which we are meant to put URIs to externally
recognized identifiers (in the informatics sense) but confusingly, they are
NOT Darwin Core terms. The standard has not been formally altered to
accommodate them. These are reasonably good short-term, band-aid terms that
GBIF created and made available in their Integrated Publishing Toolkit,
primarily as a means to gauge interest. They are a little wanting when it
comes to handling multiple collector & determiner names - we're meant to
separate these with vertical bars -, collecting parties or expeditions, or
more abstract representations & groupings of people, eg. "Mrs. Smith's
Grade 4 class". I'd like to see an alternate, more accurate way to
represent "agents" as well as a cleaner separation of agents from the
actions they executed.

A group of us in the TDWG community are working on an extension to Darwin
Core called "Agent Actions" in which a controlled vocabulary + definitions
of action verbs will be part of its remit: collected, identified,
georeferenced, measured, etc. There will also be accommodation for ordering
of people names per specimen, each with external identifiers. Some of us
might make quick use of this extension to Darwin Core, most of us will not
in the short term. At the very least, I'm hopeful it will play a part in
allowing our stories about the characters in our collections to escape, be
computationally useful, & become enriched with stories shared by others
well outside our narrow collections community.

I warned you that this wouldn't be quick,

David Shorthouse

On Wed, May 6, 2020 at 6:06 PM Christy Bills <[log in to unmask]> wrote:
> Quick question:
> When you are databasing specimens, if the label has a collector's
> initials, or some other shortened version of their name, but you know who
> it is, do you transcribe into the database who you are 99% sure the
> collector is or do you faithfully record the label data exactly as is?
> We are using Symbiota/SCAN and I don't see a field where it would be
> appropriate to make a notation that we made that assumption (like "verbatim
> collector data").
> Is this just me?  tbh, I'm finding it difficult to concentrate since this
> has all gone down!
> Happy to get replies just to me so we don't clutter this list.
> Thank you!
> Christy Bills
> Invertebrate Collections Manager
> Natural History Museum of Utah
> 301 Wakara Way
> Salt Lake City, Utah 84108
> pronouns: she/her