Here is the gist of a talk that I gave with kind assistance and input from Wendy Robertson at the recent DC-GLUG 2017 –hosted by the Michael Schwartz Library at Cleveland State University –about a project that four bepress IR coordinators, who were all also part of the first cohort of SHARE digital curation associates, worked on together.
What is SHARE?
SHARE is an initialism of : Shared access research ecosystem.
“SHARE is a higher education initiative whose mission is to maximize research impact by making research widely accessible, discoverable, and reusable. To fulfill this mission SHARE is developing services to gather and freely share information about research and scholarly activities across their life cycle. Making research and scholarship freely and openly available encourages innovation and increases the diversity of innovators.
Where open metadata about research already exists, its usefulness is limited by poor or inconsistent quality or by difficulty of access. For most individuals or groups to use this data, the cost of accessing, collecting, and improving the data is too great.”
A partnership of the Association of Research Libraries, The Association of Research Libraries (ARL), The Association of American Universities (AAU), and The Association of Public and Land-grant Universities (APLU), these organizations initiated and founded SHARE. In collaboration with the Center for Open Science, SHARE is building a critical infrastructure to enable research outputs to be discoverable and reusable and so that these digital assets of research will be traceable throughout their life cycle.
The SHARE 2.0 release has “enhanced search capabilities, including filtering by preprint or publication, by subject, by funder, and by institution. SHARE 2.0 gives you the information you need to find new, relevant research and to find potential collaborators.”
As of July 2017, there are 156 providers to the SHARE database. In addition to harvesting from a large number of institutional repositories, other open data sources include Crossref, Biomed Central, PLOS, Dryad data repository, among others.
Having your metadata harvested by SHARE is easy!
It is easy because the SHARE pipeline is format agnostic and normalizes the data for you. For an early example and explanation, see Rick Johnstons’ blog post SHARE Metadata Is Stitching Together the Research Life Cycle. All you need to do is register via an online form and provide the SHARE folks with the base URL of your repository, followed by do/oai. For those who prefer, one can also push directly into the SHARE database by using the API.
SHARE digital curation group
As part of the 2016-2017 SHARE Curation Associates program, we were asked to review our own repository metadata and verify which fields are pushed to the OAI-MPH endpoint and see how they were captured by the SHARE harvester. Several of us were using the bepress platform and thought it a worthy effort to collaborate on our research and our findings.
Lisa Palmer, Emily Stenberg, Wendy Robertson and I talked and emailed over several months and prepared a gap analysis of the metadata provided by our institutions and harvested by SHARE. We had three goals in mind: to improve institutional metadata curation processes; to provide good and consistent metadata to SHARE; and to develop recommendations that other Digital Commons institutions could apply, whether they were SHARE contributors or not. The intent was to help create a shared understanding of how our data was harvested. We began by looking at Digital Commons default Dublin Core mapping for various kinds of bepress collection structures and mapping it to the then current SHARE schema (it is still in beta and continues to evolve).
Our initial findings were presented as a poster by Lisa Palmer at ACRL 2017, Mind the gap: Curating Digital Commons Metadata for SHARE.
Harvesting bepress metadata
Digital Commons exposes metadata fields for harvesting through four different metadata formats, or prefixes, as shown below:
To view your own data, and return a set of associated metadata records exposed through OAI, you can adapt the following sample which will display your IR’s first 100 records in a repository-level request:
http: // [Site URL]/do/oai/?verb=ListRecords&metadataPrefix=[Enter oai_dc]
For more instruction of bepress harvesting see the Digital Commons and OAI-PMH guide at https://www.bepress.com/reference_guide_dc/digital-commons-oai-harvesting/
Highlights of our Project
As we delved into the data, we noticed some issues and sometimes struggled to understand clearly what various pieces of data meant, when taken out of context of the IR. At times it was simply a default that didn’t fit for all collections up to more complex issues, such as identifying which of the possible dates the item carried, or how to retain institutional affiliation and disambiguate names. Since pre-print, post print and version of record may be of import to our users, how might we best indicate that in our metadata? Distinguishing between different sorts of dates can be very challenging to apply consistently. And while the OpenURL links and takes us to the final published product, how can we better indicate to what dc:format and dc:source refer, since examples such as <dc:format.extent>297</dc:format.extent> might be better described.
The following slides are some examples of issues that we discovered our first dive into the data:
Some things we discovered en route
Since the SHARE aggregator is still undergoing development, our metadata target has not necessarily been a stable one, and over the past year the schema continued to be reformed and improved. As we have mae suggestions and as SHARE is evolving, changes have been made and will need to continue to be monitored; but for now these are some of our general recommendations.
- Consult bepress documentation on metadata options and OAI-PMH
- Review how records for different collections are exposed in the various bepress OAI-PMH formats. Are custom fields mapping as expected/desired?
- Create standard metadata using consistent internal field names. Develop ideal format for each collection type on your demo site and use it as a reference going forward
- Add a “data dictionary” to your repository at the collection level. Work with bepress consultant to modify and migrate existing collections using this documentation.
- Share your practices publicly
- Link to your data dictionary from your repository to an external site such as Google Sheets or GitHub, and Share with Digital Commons user group or Resource Library
- Every OAI source supports oai_dc, but they usually also support at least one other format that has richer, more structured data, like oai_datacite or mods.
- Choose the format that seems to have the most useful data for SHARE, especially if a transformer for that format already exists.
- Choose oai_dc only as a last resort.
Your feedback is welcome!
And our final piece, we have also created best practices for a number of specific metadata fields and look to bepress and our IR colleagues to cast a critical eye over them.
We have looked carefully at Digital Commons standard mapping to Dublin Core, broader Dublin Core practices, and DataCite guidelines (which are preferred by SHARE). We are hoping the broader Digital Commons community can give feedback on our recommendations. We are hoping that if many of us can agree on the same practices, it will be easier for bepress and SHARE to implement.
Our best practices document is here and we would welcome comments on the document and discussion on the list, or you can contact any of us directly with comments: Best Practices for Mapping Digital Commons Metadata for Harvesting by SHARE