All posts by Chris Gibson

About Chris Gibson

Research Services Librarian (Data) The University of Manchester Library

How data services can support a FAIR data culture: insights from IDCC 2020

This year I was delighted to attend and present a poster at IDCC 2020, which put together a truly thought-provoking line-up of speakers and topics, as well as a number of wonderful opportunities to sample some of Dublin’s cultural attractions. Even more than the delights of the “fair city”, I was especially interested in one important theme of the conference which explored supporting a FAIR data culture. Inspired by the many valuable contributions, this post outlines some of the key insights presented on this topic.

An excellent hook around which to frame this review is, I think, offered by the figure below capturing results from the FAIRsFAIR open consultation on Policy and Practice, which was one focus of Joy Davidson’s illuminating overview of developments in this area. The top factor influencing researchers to make data FAIR, when we take both positive points on the scale together, is the level of support provided.

IDCC_Blog_Figure1_V2.png
Source: FAIRsFAIR report

So, let’s take a closer look at some of the key developments and opportunities for data services to enhance support for FAIR culture, bearing in mind of course that, when it comes to shaping service developments, local solutions must be informed by local contexts taking into account factors such as research strategy, available resources and service demand.

Enhancing the FAIR Support Infrastructure

That making data FAIR is an endeavour shared by researchers and data services was neatly illustrated by Sarah Jones. Her conclusion that equal, if not more, responsibility lies with data services gives cause to reflect on where and how we may need to raise our capabilities.

Let’s look here at three areas of opportunity for developing our support mechanisms around data stewardship, institutional repositories, and training.

Professionalising Data Stewardship

In 2016, Barend Mons predicted that 500,000 data stewards would need to be trained in Europe over the following decade to ensure effective research data management. Given this sort of estimate, it’s clear that our ability to build and scale data stewardship capability will be critical if we agree that data stewardship and data management skills are key enablers for research. Two particularly interesting developments in this area were presented.

Mijke Jetten outlined one project that examined the data steward function in terms of tasks and responsibilities, and the competencies required to deliver on these. The objective is a common job description, which then offers a foundation from which to develop a customised training and development pathway – informed of course by FAIR data principles, since alignment with FAIR is seen as a central tenet of good data stewardship. Although the project focused on life sciences in the Netherlands, its insights are highly transferable to other research domains.

Equally transferable is the pilot project highlighted by the Conference’s “best paper” from Virginia Tech, which described an innovative approach to addressing the challenge of effectively resourcing support across the data lifecycle in the context of ever-growing demand for services. Driven by the University Libraries, the DataBridge programme trains and mentors students in key data science skills to work across disciplines on real-world research data challenges. This approach not only provides valuable and scalable support for the research process, but serves also to develop data champions of the future, skilled and invested in FAIR data principles.

Leveraging Institutional Data Repositories

As a key part of the research data infrastructure, it’s clear that institutional data repositories (IRs) have an important role to play in promoting FAIR. Of course, researcher engagement and expertise are crucial to this end – as we rely on them to create good metadata and documentation that will facilitate discovery and re-use of their data.

In terms of fostering engagement, inspiring trust in an IR would seem to be an important fundamental, and formal certification is one way to build researchers’ confidence that their data will be well-cared for in the longer term by their repository. Ilona von Stein outlined one such certification framework, the CoreTrustSeal, which seems particularly useful since there’s a strong overlap between its requirements and FAIR principles. In terms of enhancing a repository’s reputation, one important post-Conference development worth noting is the recent publication of the TRUST Principles for digital repositories which offers a framework for guiding best practice and demonstrating IR trustworthiness.

Ilona also pointed to ongoing developments in terms of tools to support pre- and post-deposit assessment of data FAIRness. SATIFYD, for example, is an online questionnaire that helps researchers evaluate, at pre-deposit stage, how FAIR their dataset is and offers tips to make it more so. Developed by DANS, a prototype of this manual self-assessment tool is currently available with plans in the offing to enable customisation for local contexts and training. One to watch out for too is the development of a post-publication, automated evaluation tool to assess datasets for their level of FAIRness over time and create a scoring system to indicate how a given dataset performs against a set of endorsed FAIR metrics.

Another fundamental to think about is how skilled our researchers may or may not be when it comes to metadata creation as well as their level of tolerance for this task. Joao Castro made the point that researchers typically regard spending more than 20 minutes on this activity as time-consuming.

This observation came out of a project at the University of Porto to engage researchers in the RDM process and underlines the need to think creatively about how we, as data professionals, can enhance the support we offer. Joao described how the provision of a consultancy-type service had been explored to support researchers in using domain-specific metadata to describe their data. Underpinned by DENDRO, an open-source collaborative RDM platform, this service was well received by researchers across a range of disciplines and served to develop their knowledge / skills in metadata production, as well as raising FAIR awareness more broadly.

Maximising Training Impact

Of course, beyond raising awareness it’s clear that the upskilling of researchers through curriculum development and training is an essential step on the road to FAIR – a key question, however, is how do we make the most of our training efforts?

Daniel Bangert helpfully summarised findings from a landscape analysis of FAIR in higher education institutions and recommended focusing FAIR training initiatives on early career researchers (ECRs). This would seem to be a particularly powerful approach for affecting ‘ground up’ culture change, since ECRs are typically involved in operational aspects of research and will become the influential researchers of tomorrow.

This same report suggests that training and communication regarding FAIR should be couched within the wider framework of research integrity and open research. Framing data management training initiatives in this way provides important context and pre-empts the risk that it will be seen purely as a compliance issue.

As an interesting aside, an extensive research integrity landscape study, commissioned by UK Research and Innovation and published post-Conference, identified ‘open data management’ as the overall most popular topic for future training – a useful channel perhaps then through which to deliver and extend reach in the UK context at least.

Both Daniel and Elizabeth Newbold highlighted the need to draw on and share best practices and existing materials, where available. Subsequent workshop discussions strongly agreed with this sentiment but noted the challenges in finding and/or repurposing existing FAIR training, guidance and resources e.g. for a specific audience or level of knowledge. Indeed, it would seem sensible that FAIR principles should be applied to FAIR training materials!

In this regard, a helpful starting point might perhaps be this recent PLOS article – Ten simple rules for making training materials FAIR. Going forward, the development of a FAIR Competence Centre, with a key focus on supporting training delivery, will be one to look out for.

IDCC_Blog_CG_20200701_Figure2.png
Poster presentation at IDCC 2020. (Photo: Rosie Higman)

In Conclusion

So, hopefully plenty of food for thought and ideas for practical next steps here to adapt for your local context, wherever you are on the road to FAIR. While the challenges to creating a FAIR data culture are many, broad and complex, we can take heart not only from the many examples of sterling work underway, but also from the highly collaborative spirit across the data services community. In the context of increasing demands on tight resources, this will serve us well as we drive the FAIR agenda.

Finding Data, Made Simple: Building a Research Data Gateway

Making data more findable is the bedrock of much of research data management and we aim to make this easy and simple for researchers to do in practice. Ever on the look out to do just this, we were delighted to spot an opportunity to take our University’s data catalogue to the next level.

The data catalogue comprises our CRIS (Pure) Datasets module, which allows researchers to capture details of their datasets, and the public facing portal (Research Explorer), which allows these datasets to be searched. When the data catalogue was originally set up it could be populated either by automated metadata feeds for datasets deposited in the data repository recommended by The University of Manchester, or by manually inputting metadata for datasets deposited in external data repositories. However, recognising that this manual input duplicates effort, is time consuming and requires some familiarity with Pure, we began to think about how we could make this process faster and easier for researchers.

Our solution? A Research Data Gateway.

Gateway to data heaven

The Research Data Gateway service allows researchers to input a dataset DOI to an online form, view the associated metadata to confirm its veracity, and then submit the dataset record to the Library, who populates Pure on the researcher’s behalf. Wherever possible our Content, Collections and Discovery (CCD) team enriches the record by linking with related research outputs, such as articles or conference proceedings, and this record displays in both Research Explorer and all relevant Researcher Profiles.

The screen capture below illustrates how the Research Data Gateway works in practice from the researcher’s perspective up to the point of submission, a process that usually takes about 15 seconds!

Figure 1: Animated screen capture of Research Data Gateway

ResearchDataGateway

In addition to delivering a service that reduces researchers’ workload, the Research Data Gateway increases the discoverability and visibility of externally deposited datasets together with their associated publications. In turn, this increases the likelihood that these research outputs will be found, re-used and cited. Moreover, since most funders and an increasing number of journals require the data that underlies papers to be shared, the Gateway helps researchers reap the maximum reward from this requirement.

The nuts and bolts

As you can see from above this is a very straight-forward process from the researcher’s perspective, but of course, behind the scenes there’s a little more going on.

The basic workflow looks like this:

BLG_Gateway_Workflow_V1a

BLG_Gateway_Workflow_V1b

Once validated, the new dataset record automatically displays in both Research Explorer and the relevant Researcher Profile(s).

As with most successful initiatives, making the Research Data Gateway happen was a truly collaborative effort involving a partnership across the Library’s Research Services (RS), Digital Technologies and Services (DTS) and Content, Collections and Discovery (CCD) teams, and the University’s Pure Support team. And this collaboration continues now in the ongoing management of the service. All Gateway-related processes have been documented and we’ve used a RACI matrix to agree which teams would be Responsible, Accountable, Consulted and Informed for any issues or enquiries that might arise.

Some technical challenges and work-arounds

As might be expected, we encountered a number of small but challenging issues along the way:

  • Datasets may be associated with tens or even hundreds of contributors which can make these records time-consuming to validate. This was a particular problem for high energy physics datasets for instance. For efficiency, our solution is to record individual contributors from this University, and then add the name of the collaboration group.
  • Datasets may include mathematical symbols that don’t display correctly when records are created in Pure. Our solution has been to use MathJax, an open-source JavaScript display engine, that renders well with many modern browsers.
  • Multiple requests for a single dataset record are sometimes submitted to Pure especially if a record has multiple contributors. To resolve this, approvals by the CCD team include a check for duplicates, and the service informs relevant researchers before rationalising any duplicates to a single record.
  • A limitation of the Gateway is that it doesn’t accommodate datasets without a DOI. So further work is needed to accommodate repositories, such as GenBank, that assign other types of unique and persistent identifiers.

Some reflections

Feedback on the Gateway has been consistently positive from researchers and research support staff; its purpose and simple effectiveness have been well-received and warmly appreciated.

However, getting researchers engaged takes time, persistence and the right angle from a communications perspective. It’s clear that researchers may not perceive a strong incentive to record datasets they’ve already shared elsewhere. Many are time poor and might reasonably question the benefit of also generating an institutional record. Therefore effective promotion continues to be key in terms of generating interest and engagement with the new Gateway service.

We’re framing our promotional message around how researchers can efficiently raise the profile of their research outputs using a suite of services including our Research Data Gateway, our Open Access Gateway, the Pure/ORCID integration, and benefit from automated reporting on their behalf to Researchfish. This promotes a joined up message explaining how the Library will help them raise their profile in return for – literally – a few seconds of their time.

We’re also tracking and targeting researchers who manually create dataset records in Pure to flag how the Research Data Gateway can save them significant time and effort.

In addition, to further reinforce the benefits of creating an institutional record, we ran a complementary, follow-up project using Scholix to find and record externally deposited datasets without the need for any researcher input. Seeing these dataset records surface in their Researcher Profiles, together with links to related research outputs is a useful means of generating interest and incentivising engagement.

To learn how we did this see my companion blog post: From Couch to Almost 5K: Raising Research Data Visibility at The University of Manchester .

These two approaches have now combined to deliver more than 5,000 data catalogue records and growing, with significant interlinking with the wider scholarly record. As noted, both routes have their limitations and so we remain on the lookout for creative ways to progress this work further, fill any gaps and make data ever more findable.

From Couch to Almost 5K: Raising Research Data Visibility at The University of Manchester

Love is all around us this week it seems. Coinciding with Valentine’s Day, by chance or otherwise, this is also Love Data Week. So, we thought we’d share how we’ve been loving our data by making it more visible, shareable and re-usable!

This is an area of growing interest across the RDM community and if you, like us, are kept awake at night by questions such as how do you identify your institution’s datasets in external repositories or what’s the most efficient way to populate your CRIS with metadata for those datasets, then read on to learn how we’ve been meeting these sorts of challenges.

At the University of Manchester (UoM), the Library’s Research Data Management team has been using Scholix to find UoM researcher data records and make them available in the University’s data catalogue and Researcher Profiles, which are publicly available and serve as a showcase for the University’s research.

We saw here an opportunity not only to increase further the visibility of the University’s research outputs but also to encourage researchers to regard data more seriously as a research output. We also had in mind the FAIR Principles and were keen to support best practice by researchers in making their data more findable.

The headline result is the addition of more than 4,500 data records to the UoM CRIS (Pure), with reciprocal links between associated data and publication records also being created to enrich the University’s scholarly record.

So how did we go about this…

Following the launch in 2017 of the University’s Pure Datasets module, which underpins our institutional data catalogue (Research Explorer) and automatically populates Researcher Profiles, we created services to help researchers record their data in Pure with as little manual effort as possible. (To illustrate, see my companion blog post: Finding Data, Made Simple: Building a Research Data Gateway.) We’re delighted to see these services being well-received and used by our research community!

But what about historical data, we wondered?

We knew most researchers wouldn’t have the time or inclination to record details of all their previous data without a strong incentive and, in any case, we wanted to spare them this effort if at all possible. We decided to investigate just how daunting or not this task might be and made the happy discovery that the Scholix initiative had done lots of the work for us by creating a huge database linking scholarly literature with their associated datasets.

Working with a number of key internal and external partners, we used open APIs to automate / part-automate the process of getting from article metadata to tailored data records (see Figure 1).

Figure 1. Process summary: making research data visible

ProcessScholix

To generate and process the article metadata from Scopus we partnered with the Library’s Research Metrics, and Digital Technologies and Services teams. We submitted the article DOIs to Scholix via its open API which returned metadata (including DOIs) of the associated research data. Then using the DataCite open API we part-automated the creation of tailored data records that mirrored the Pure submission template (i.e. the records contained the relevant metadata in the same order). This saved our Content, Collections and Discovery team lots of time when manually inputting the details to Pure, before validating the records to make them visible in Research Explorer and Researcher Profiles.

Partnering with the University’s Directorate of Research and Business Engagement and Elsevier, we followed the same steps to process the records sourced from Pure. Elsevier was also able to prepare tailored data records for bulk upload directly into Pure which further streamlined the process.

Some challenges and lessons learned…

Manchester researchers like to share, especially if we can make it easy for them! Seeing the amount of data being shared across the institution is bringing us a lot of joy and a real sense of return on investment. In terms of staff time, which amounts to approximately 16 FTE weeks to upload, validate and link data in Pure, plus some additional time to plan and implement workflows. Cross-team working has been critical in bringing this project towards successful completion, with progress relying on the combined expertise of seven teams. In our view, the results more than justify this investment.

Of course, there are limitations to be addressed and technical challenges to navigate.

Initiatives, such as the COPDESS Enabling FAIR Data Project, that are bringing together relevant stakeholders (data communities, publishers, repositories and data ecosystem infrastructure) will help ensure that community-agreed metadata is properly recorded by publishers and repositories, so that it can feed into initiatives like Scholix and make our ‘downstream’ work ever more seamless. Widespread engagement for use of open identifiers would also make our work much easier and faster, in particular identifiers for researchers (ORCID) and research organisations (RoR). As ever, increased interoperability and automation of systems would be a significant step forward.

There are practical considerations as well. For instance, how do we treat data records with many researchers, which are more time-consuming to handle? How do we prepare researchers with lots of datasets for the addition of many records to their Researcher Profiles when there is such variation in norms and preferences across disciplines and individuals?  How should we handle data collections? What do we do about repositories such as Array Express that use accession numbers rather than DOIs, as Scholix can’t identify data from such sources. And since Scholix only finds data which are linked to a research article how do we find data which are independent assets? If we are really serious about data being an output in their own right then we need to develop a way of doing this.

So, there’s lots more work to be done and plenty of challenges to keep us busy and interested.

In terms of the current phase of the project, processing is complete for data records associated with UoM papers from Scopus, with Pure records well underway. Researcher engagement is growing, with plenty of best practice in evidence. With REF 2021 in our sights, we’re also delighted to be making a clear contribution towards the research environment indicators for Open Data.

Update: We are openly sharing code that was created for this project via Github so that others can also benefit from our approach.

Many rail tracks

Are you on track with the EPSRC policy framework on research data?

EPSRC logoIf you’re not already aware, the EPSRC requirements around the management of and access to EPSRC-funded research data are mandatory from 1 May 2015.

If your research is funded by the EPSRC, we’ve summarised the key points to help you comply with the EPSRC policy framework on research data. Read our guidance to find out what you need to do.

If you want to know more about managing your research data, please contact our Research Data Management team.