Love is all around us this week it seems. Coinciding with Valentine’s Day, by chance or otherwise, this is also Love Data Week. So, we thought we’d share how we’ve been loving our data by making it more visible, shareable and re-usable!
This is an area of growing interest across the RDM community and if you, like us, are kept awake at night by questions such as how do you identify your institution’s datasets in external repositories or what’s the most efficient way to populate your CRIS with metadata for those datasets, then read on to learn how we’ve been meeting these sorts of challenges.
At the University of Manchester (UoM), the Library’s Research Data Management team has been using Scholix to find UoM researcher data records and make them available in the University’s data catalogue and Researcher Profiles, which are publicly available and serve as a showcase for the University’s research.
We saw here an opportunity not only to increase further the visibility of the University’s research outputs but also to encourage researchers to regard data more seriously as a research output. We also had in mind the FAIR Principles and were keen to support best practice by researchers in making their data more findable.
The headline result is the addition of more than 4,500 data records to the UoM CRIS (Pure), with reciprocal links between associated data and publication records also being created to enrich the University’s scholarly record.
So how did we go about this…
Following the launch in 2017 of the University’s Pure Datasets module, which underpins our institutional data catalogue (Research Explorer) and automatically populates Researcher Profiles, we created services to help researchers record their data in Pure with as little manual effort as possible. (To illustrate, see my companion blog post: Finding Data, Made Simple: Building a Research Data Gateway.) We’re delighted to see these services being well-received and used by our research community!
But what about historical data, we wondered?
We knew most researchers wouldn’t have the time or inclination to record details of all their previous data without a strong incentive and, in any case, we wanted to spare them this effort if at all possible. We decided to investigate just how daunting or not this task might be and made the happy discovery that the Scholix initiative had done lots of the work for us by creating a huge database linking scholarly literature with their associated datasets.
Working with a number of key internal and external partners, we used open APIs to automate / part-automate the process of getting from article metadata to tailored data records (see Figure 1).
Figure 1. Process summary: making research data visible
To generate and process the article metadata from Scopus we partnered with the Library’s Research Metrics, and Digital Technologies and Services teams. We submitted the article DOIs to Scholix via its open API which returned metadata (including DOIs) of the associated research data. Then using the DataCite open API we part-automated the creation of tailored data records that mirrored the Pure submission template (i.e. the records contained the relevant metadata in the same order). This saved our Content, Collections and Discovery team lots of time when manually inputting the details to Pure, before validating the records to make them visible in Research Explorer and Researcher Profiles.
Partnering with the University’s Directorate of Research and Business Engagement and Elsevier, we followed the same steps to process the records sourced from Pure. Elsevier was also able to prepare tailored data records for bulk upload directly into Pure which further streamlined the process.
Some challenges and lessons learned…
Manchester researchers like to share, especially if we can make it easy for them! Seeing the amount of data being shared across the institution is bringing us a lot of joy and a real sense of return on investment. In terms of staff time, which amounts to approximately 16 FTE weeks to upload, validate and link data in Pure, plus some additional time to plan and implement workflows. Cross-team working has been critical in bringing this project towards successful completion, with progress relying on the combined expertise of seven teams. In our view, the results more than justify this investment.
Of course, there are limitations to be addressed and technical challenges to navigate.
Initiatives, such as the COPDESS Enabling FAIR Data Project, that are bringing together relevant stakeholders (data communities, publishers, repositories and data ecosystem infrastructure) will help ensure that community-agreed metadata is properly recorded by publishers and repositories, so that it can feed into initiatives like Scholix and make our ‘downstream’ work ever more seamless. Widespread engagement for use of open identifiers would also make our work much easier and faster, in particular identifiers for researchers (ORCID) and research organisations (RoR). As ever, increased interoperability and automation of systems would be a significant step forward.
There are practical considerations as well. For instance, how do we treat data records with many researchers, which are more time-consuming to handle? How do we prepare researchers with lots of datasets for the addition of many records to their Researcher Profiles when there is such variation in norms and preferences across disciplines and individuals? How should we handle data collections? What do we do about repositories such as Array Express that use accession numbers rather than DOIs, as Scholix can’t identify data from such sources. And since Scholix only finds data which are linked to a research article how do we find data which are independent assets? If we are really serious about data being an output in their own right then we need to develop a way of doing this.
So, there’s lots more work to be done and plenty of challenges to keep us busy and interested.
In terms of the current phase of the project, processing is complete for data records associated with UoM papers from Scopus, with Pure records well underway. Researcher engagement is growing, with plenty of best practice in evidence. With REF 2021 in our sights, we’re also delighted to be making a clear contribution towards the research environment indicators for Open Data.
Update: We are openly sharing code that was created for this project via Github so that others can also benefit from our approach.