All posts by Chris Gibson

About Chris Gibson

Research Services Librarian (Data) The University of Manchester Library

Finding Data, Made Simple: Building a Research Data Gateway

Making data more findable is the bedrock of much of research data management and we aim to make this easy and simple for researchers to do in practice. Ever on the look out to do just this, we were delighted to spot an opportunity to take our University’s data catalogue to the next level.

The data catalogue comprises our CRIS (Pure) Datasets module, which allows researchers to capture details of their datasets, and the public facing portal (Research Explorer), which allows these datasets to be searched. When the data catalogue was originally set up it could be populated either by automated metadata feeds for datasets deposited in the data repository recommended by The University of Manchester, or by manually inputting metadata for datasets deposited in external data repositories. However, recognising that this manual input duplicates effort, is time consuming and requires some familiarity with Pure, we began to think about how we could make this process faster and easier for researchers.

Our solution? A Research Data Gateway.

Gateway to data heaven

The Research Data Gateway service allows researchers to input a dataset DOI to an online form, view the associated metadata to confirm its veracity, and then submit the dataset record to the Library, who populates Pure on the researcher’s behalf. Wherever possible our Content, Collections and Discovery (CCD) team enriches the record by linking with related research outputs, such as articles or conference proceedings, and this record displays in both Research Explorer and all relevant Researcher Profiles.

The screen capture below illustrates how the Research Data Gateway works in practice from the researcher’s perspective up to the point of submission, a process that usually takes about 15 seconds!

Figure 1: Animated screen capture of Research Data Gateway

ResearchDataGateway

In addition to delivering a service that reduces researchers’ workload, the Research Data Gateway increases the discoverability and visibility of externally deposited datasets together with their associated publications. In turn, this increases the likelihood that these research outputs will be found, re-used and cited. Moreover, since most funders and an increasing number of journals require the data that underlies papers to be shared, the Gateway helps researchers reap the maximum reward from this requirement.

The nuts and bolts

As you can see from above this is a very straight-forward process from the researcher’s perspective, but of course, behind the scenes there’s a little more going on.

The basic workflow looks like this:

BLG_Gateway_Workflow_V1a

BLG_Gateway_Workflow_V1b

Once validated, the new dataset record automatically displays in both Research Explorer and the relevant Researcher Profile(s).

As with most successful initiatives, making the Research Data Gateway happen was a truly collaborative effort involving a partnership across the Library’s Research Services (RS), Digital Technologies and Services (DTS) and Content, Collections and Discovery (CCD) teams, and the University’s Pure Support team. And this collaboration continues now in the ongoing management of the service. All Gateway-related processes have been documented and we’ve used a RACI matrix to agree which teams would be Responsible, Accountable, Consulted and Informed for any issues or enquiries that might arise.

Some technical challenges and work-arounds

As might be expected, we encountered a number of small but challenging issues along the way:

  • Datasets may be associated with tens or even hundreds of contributors which can make these records time-consuming to validate. This was a particular problem for high energy physics datasets for instance. For efficiency, our solution is to record individual contributors from this University, and then add the name of the collaboration group.
  • Datasets may include mathematical symbols that don’t display correctly when records are created in Pure. Our solution has been to use MathJax, an open-source JavaScript display engine, that renders well with many modern browsers.
  • Multiple requests for a single dataset record are sometimes submitted to Pure especially if a record has multiple contributors. To resolve this, approvals by the CCD team include a check for duplicates, and the service informs relevant researchers before rationalising any duplicates to a single record.
  • A limitation of the Gateway is that it doesn’t accommodate datasets without a DOI. So further work is needed to accommodate repositories, such as GenBank, that assign other types of unique and persistent identifiers.

Some reflections

Feedback on the Gateway has been consistently positive from researchers and research support staff; its purpose and simple effectiveness have been well-received and warmly appreciated.

However, getting researchers engaged takes time, persistence and the right angle from a communications perspective. It’s clear that researchers may not perceive a strong incentive to record datasets they’ve already shared elsewhere. Many are time poor and might reasonably question the benefit of also generating an institutional record. Therefore effective promotion continues to be key in terms of generating interest and engagement with the new Gateway service.

We’re framing our promotional message around how researchers can efficiently raise the profile of their research outputs using a suite of services including our Research Data Gateway, our Open Access Gateway, the Pure/ORCID integration, and benefit from automated reporting on their behalf to Researchfish. This promotes a joined up message explaining how the Library will help them raise their profile in return for – literally – a few seconds of their time.

We’re also tracking and targeting researchers who manually create dataset records in Pure to flag how the Research Data Gateway can save them significant time and effort.

In addition, to further reinforce the benefits of creating an institutional record, we ran a complementary, follow-up project using Scholix to find and record externally deposited datasets without the need for any researcher input. Seeing these dataset records surface in their Researcher Profiles, together with links to related research outputs is a useful means of generating interest and incentivising engagement.

To learn how we did this see my companion blog post: From Couch to Almost 5K: Raising Research Data Visibility at The University of Manchester .

These two approaches have now combined to deliver more than 5,000 data catalogue records and growing, with significant interlinking with the wider scholarly record. As noted, both routes have their limitations and so we remain on the lookout for creative ways to progress this work further, fill any gaps and make data ever more findable.

From Couch to Almost 5K: Raising Research Data Visibility at The University of Manchester

Love is all around us this week it seems. Coinciding with Valentine’s Day, by chance or otherwise, this is also Love Data Week. So, we thought we’d share how we’ve been loving our data by making it more visible, shareable and re-usable!

This is an area of growing interest across the RDM community and if you, like us, are kept awake at night by questions such as how do you identify your institution’s datasets in external repositories or what’s the most efficient way to populate your CRIS with metadata for those datasets, then read on to learn how we’ve been meeting these sorts of challenges.

At the University of Manchester (UoM), the Library’s Research Data Management team has been using Scholix to find UoM researcher data records and make them available in the University’s data catalogue and Researcher Profiles, which are publicly available and serve as a showcase for the University’s research.

We saw here an opportunity not only to increase further the visibility of the University’s research outputs but also to encourage researchers to regard data more seriously as a research output. We also had in mind the FAIR Principles and were keen to support best practice by researchers in making their data more findable.

The headline result is the addition of more than 4,500 data records to the UoM CRIS (Pure), with reciprocal links between associated data and publication records also being created to enrich the University’s scholarly record.

So how did we go about this…

Following the launch in 2017 of the University’s Pure Datasets module, which underpins our institutional data catalogue (Research Explorer) and automatically populates Researcher Profiles, we created services to help researchers record their data in Pure with as little manual effort as possible. We’re delighted to see these services being well-received and used by our research community!

But what about historical data, we wondered?

We knew most researchers wouldn’t have the time or inclination to record details of all their previous data without a strong incentive and, in any case, we wanted to spare them this effort if at all possible. We decided to investigate just how daunting or not this task might be and made the happy discovery that the Scholix initiative had done lots of the work for us by creating a huge database linking scholarly literature with their associated datasets.

Working with a number of key internal and external partners, we used open APIs to automate / part-automate the process of getting from article metadata to tailored data records (see Figure 1).

Figure 1. Process summary: making research data visible

ProcessScholix

To generate and process the article metadata from Scopus we partnered with the Library’s Research Metrics, and Digital Technologies and Services teams. We submitted the article DOIs to Scholix via its open API which returned metadata (including DOIs) of the associated research data. Then using the DataCite open API we part-automated the creation of tailored data records that mirrored the Pure submission template (i.e. the records contained the relevant metadata in the same order). This saved our Content, Collections and Discovery team lots of time when manually inputting the details to Pure, before validating the records to make them visible in Research Explorer and Researcher Profiles.

Partnering with the University’s Directorate of Research and Business Engagement and Elsevier, we followed the same steps to process the records sourced from Pure. Elsevier was also able to prepare tailored data records for bulk upload directly into Pure which further streamlined the process.

Some challenges and lessons learned…

Manchester researchers like to share, especially if we can make it easy for them! Seeing the amount of data being shared across the institution is bringing us a lot of joy and a real sense of return on investment. In terms of staff time, which amounts to approximately 16 FTE weeks to upload, validate and link data in Pure, plus some additional time to plan and implement workflows. Cross-team working has been critical in bringing this project towards successful completion, with progress relying on the combined expertise of seven teams. In our view, the results more than justify this investment.

Of course, there are limitations to be addressed and technical challenges to navigate.

Initiatives, such as the COPDESS Enabling FAIR Data Project, that are bringing together relevant stakeholders (data communities, publishers, repositories and data ecosystem infrastructure) will help ensure that community-agreed metadata is properly recorded by publishers and repositories, so that it can feed into initiatives like Scholix and make our ‘downstream’ work ever more seamless. Widespread engagement for use of open identifiers would also make our work much easier and faster, in particular identifiers for researchers (ORCID) and research organisations (RoR). As ever, increased interoperability and automation of systems would be a significant step forward.

There are practical considerations as well. For instance, how do we treat data records with many researchers, which are more time-consuming to handle? How do we prepare researchers with lots of datasets for the addition of many records to their Researcher Profiles when there is such variation in norms and preferences across disciplines and individuals?  How should we handle data collections? What do we do about repositories such as Array Express that use accession numbers rather than DOIs, as Scholix can’t identify data from such sources. And since Scholix only finds data which are linked to a research article how do we find data which are independent assets? If we are really serious about data being an output in their own right then we need to develop a way of doing this.

So, there’s lots more work to be done and plenty of challenges to keep us busy and interested.

In terms of the current phase of the project, processing is complete for data records associated with UoM papers from Scopus, with Pure records well underway. Researcher engagement is growing, with plenty of best practice in evidence. With REF 2021 in our sights, we’re also delighted to be making a clear contribution towards the research environment indicators for Open Data.

Many rail tracks

Are you on track with the EPSRC policy framework on research data?

EPSRC logoIf you’re not already aware, the EPSRC requirements around the management of and access to EPSRC-funded research data are mandatory from 1 May 2015.

If your research is funded by the EPSRC, we’ve summarised the key points to help you comply with the EPSRC policy framework on research data. Read our guidance to find out what you need to do.

If you want to know more about managing your research data, please contact our Research Data Management team.