Report published 12 December 2022.
Future Research Assessment Programme launch event: Machine learning, metrics & merit: the future of research assessment: 12 December 2022. Watch a recording of the launch event.
The Responsible Use Of Technology-assisted Research Assessment project is part of the The Future Research Assessment Programme, which "aims to explore possible approaches to the assessment of UK higher education research performance". It is led by the by all four UK higher education funding bodies. This page will link to all relevant documents, when published.
Main report on the potential uses of AI in future Research Excellence Frameworks (REFs) in the UK
This investigates whether there is a role for AI to support future REFs. ***Download the main report here: Can REF output quality scores be assigned by AI? Experimental evidence. ***
Talk summarising the study.
List of recommendations from the main report
- Peer review is at the heart of REF and AI systems cannot yet replace human judgements. They can currently only exploit shallow attributes of articles to guess their quality and are not capable of assessing any meaningful aspects of originality, robustness and significance. They are not accurate enough to replace expert scores, they would encourage conservative behaviour, such as targeting high impact journals, and they would encourage gaming, such as gift authorship or citation cartels. AI predictions should not therefore replace peer review scores, or reduce the number of peer reviewers within a sub-panel.
- Strategy 4 only: The AI predictions are not ready to replace the current use of bibliometrics (partly because they are not available at the start). Instead, sub-panels should be given the option to consult the AI predictions when at least half of the outputs have been reliably scored. The predictions and probabilities can be used as the UoA decides, as such as mopping up final disagreements that the bibliometrics do not help with, a second opinion on difficult interdisciplinary outputs that have not been successfully cross-referred, and as a sanity check to look for anomalies. Alternatively, sub-panel members may choose to examine the AI predictions to evaluate their potential for their UoA rather than to support decisions about outputs.
- Strategy 4 only: The AI system should predict scores for REF journal articles and make the predictions and the prediction probabilities available to the subpanels that opt to receive them near the end of the assessment period. They should complement but not replace the bibliometrics, which should continue to be made available throughout the assessment period. The overall prediction accuracy for the UoA should also be presented for context.
- AI models should be built separately for each UoA for all except the two most recent years as a combined set. The models should make predictions on all articles from all UoAs (except the two most recent years) in case other UoAs want to use the predictions on interdisciplinary research submitted to them.
- Strategy 4 only: AI predictions should be hidden from evaluators until (i) at least 50% of the articles have been evaluated (ii) enough norm-referencing of scores within a sub-panel has occurred that the sub-panel scores are close to the eventual level of accuracy and (iii) the subpanel is towards the end of the assessment period and dealing with remaining difficult cases. For Strategy 5, the outputs should be only used for pilot testing.
- The inputs to the system are as specified in this document (the maximum set). Strategy 4 only: Subpanels should be given the option of considering the corresponding author to be the most important position rather than the first author (e.g., in Chemistry they probably designed the proposal and funded the work).
- New AI models should be trained from these inputs with provisional scores from the next REF rather than using AI models built from provisional REF2021 scores because a 60% increase in 4* journal articles between REF2014 and REF2021, combined with changes in the journal publishing system mean that the REF2021 AI systems may not predict reliably for the next REF.
- Strategy 4 only: In the longer term, Strategy 4 may replace the bibliometrics and perform a similar role, helping to resolve difficult cases. Panel members should be asked for their attitudes towards this after seeing the AI predictions for their UoA and completing the next REF.
- Although the main results in this report are based evaluating 50% of the outputs, the system should use all available scores because this will improve accuracy.
- Strategy 4 only: The importance of ignoring JIFs and journals during evaluations should be continued and emphasised, with the journal component of the AI system inputs being allowed as the sole exception, and explicitly explained as making a very minor contribution to REF decision making (secondary even to bibliometrics). This would ideally encourage assessors that favour JIFs to completely ignore them, leaving them to the AI. Similarly, as for bibliometrics currently, the importance of directly evaluating the quality of articles using disciplinary expertise and reading them should continue to be emphasised.
- The AI system is complex and should be seamlessly built into the REF computer system at least a year in advance so that it can be tested and does not cause delays. The python code for building the AI models is available to help.
- The tender process for the bibliometric information supplier should include a requirement to calculate and display the AI predictions from the bibliometric and other information, as specified above. Consider also adding the requirement for an effective article level classification scheme to the bibliometric tender to help panel chairs allocate outputs to reviewers. [This recommendation is also in the literature review.]
- During the next REF, UKRI should make plans to save sub-panel assigned disciplinary classifications and fine-grained scores on the extended scale because these will be valuable additional inputs for future AI experiments.
- During the next REF, institutions should be encouraged to self-archive versions of their articles that are suitable for text mining to support future, more powerful AI. Ideally, they would be in standard XML format (e.g., https://jats.nlm.nih.gov/publishing/) but in practical terms a plain text version or a watermark-free PDF would be helpful. [This recommendation is also in the literature review.]
- A deeper understanding of how experts make peer review judgements in different fields is needed together with innovative ideas for developing AI to exploit this understanding. Future research is needed to address these challenges. Possible sources of this include pilot studies with volunteers on REF-like tasks, where participants explain their decisions, and existing feedback from institutional mock REF exercises. The first would need careful framing if run by UKRI to avoid pressurising participants and the second would need GDPR clearance, such as through agreement by output authors and reviewers individually. This exercise could also be watchful for decision-making criteria that may be biased in terms of equality, diversity and inclusion (EDI).
Literature review on AI in research assessment
Literature review: Reviews research related to possible AI automation of various REF tasks. Makes a list of separate recommendations for the future REF in terms of tasks that could be partly automated.
List of recommendations from the literature review.
- Implement a system to recommend sub-panel members to review outputs. This would likely be based on the ORCIDs of sub-panel members matching their Scopus/Web of Science/Dimensions/etc. profiles, then using text mining to assess the similarity of their outputs with each sub-panel output to be assessed. The text mining might use article titles, abstracts, field classifications and references.
- Build for the long-term implementation of quality control systems for academic articles by recommending that preprints of outputs for the next REF are saved in format suitable for text mining. Ideally, this would be in a markup format, such as XML, rather than PDF. This will also help longer-term AI systems for predicting REF journal article scores with article full text processing. At the end of the next REF, a future technology programme could then investigate the potential for full text mining for quality control purposes (e.g., checking statistics, plagiarism checks).
- Build for the long-term exploitation of open peer review by, at the end of the next REF, calling for a review of current progress in the use of AI to exploit open peer review to assess article quality. Whilst open peer review should not be used as an input because it can be too easily exploited, investigations into its properties might shed useful light on aspects of quality identified by reviewers. Research into this is likely to occur over the next few years, and a review of it near the next REF might provide useful insights for both future AI and future human peer review guidelines for sub-panel members.
- In the next REF, collate information on inter-reviewer agreement rates within sub-panels for outputs scored before cross-checking between reviewers. Use this to assess the human level agreement rates (for all output types) to use as a benchmark for score prediction AI systems.
- In the tender for bibliometrics and AI for the next REF (if used), mention the importance of accurate classification for bibliometric indicators, including for the percentile system currently used.
- Warn sub-panel members of the potential for small amounts of bias in the bibliometric data and AI (if used) and continue with the anti-bias warnings/training employed in REF2021.
A series of additional reports give findings related to research quality of journal articles, covering related aspects such as bibliometrics, journal impact factors, collaboration, funding, altmetrics.
- Statistical analysis: factors able to predict REF scores: Compares the relative strengths of the initially proposed inputs for machine learning. This was used to help select the inputs for the machine learning experiments in the main report.
- Predicting article quality scores with machine learning: This summarises the machine learning findings of the main report above but sets the results in a wider research context.
- Do bibliometrics introduce gender, institutional or interdisciplinary biases into research evaluations?Shows that bibliometrics may introduce biases against high quality departments when used as indicators of research quality. Implications: REF sub-panels using bibliometrics should be warned about the slight bibliometric bias against high scoring departments.
- Is big team research fair in national research assessments? The case of the UK Research Excellence Framework 2021. Shows that highly collaborative articles probably do not skew REF results, except possibly in a few UoAs. Implications: Supports maintaining the status quo in terms of allowing collaborative articles to count at full value for each author. Published as: Thelwall, M., Kousha, K., Makita, M., Abdoli, M., Stuart, E., Wilson, P. & Levitt, J. (2023). Is big team research fair in national research assessments? The case of the UK Research Excellence Framework 2021. Journal of Data and Information Science, 8(1), 9-20. https://doi.org/10.2478/jdis-2023-0004
- In which fields are citations indicators of research quality? Identifies the UoAs in which citations can reasonably be used as quality indicators and shows that there is no UoA with a citation threshold for 4* research. Implications: Supports continued use of bibliometrics and gives evidence to UoAs about the extent to which citations agree with quality scores in their area.
- In which fields is journal impact an indicator of article quality? Identifies the UoAs in which journal citation rates can reasonably be used as quality indicators and shows that the journal alone never determines the quality of an article. Implications: Gives evidence that can be used to support the UKRI DORA commitment and to convince sub-panel members that JIFs are never substitutes for reading articles.
- Does the perceived quality of interdisciplinary research vary between fields? Identifies a partial hierarchy of quality judgement strictness for interdisciplinary research submitted to multiple UoAs. Implications: Raises an issue for discussion about the extent to which quality standards are equivalent between UoAs.
- Can qualitative research be world-leading? Terms in article titles, abstracts, and keywords associating with high or low quality. Identifies writing styles, methods and topics within UoAs that associate with higher and lower quality research as well as likely reasons for some of them. Implications: Raises an issue for discussion about whether some types of research are appropriately valued in REF scoring. Published as: Thelwall, M., Kousha, K., Abdoli, M., Stuart, E., Makita, M., Wilson, P. & Levitt, J. (in press). Terms in journal articles associating with high quality: Can qualitative research be world-leading? Journal of Documentation. https://doi.org/10.1108/JD-12-2022-0261 A previous article mentioned that qualitative research might have become more conservative to fit with journal requirements, which would align with some of our findings.
- Are co-authored articles higher quality in all fields? A science-wide analysis. Identifies when collaboration gives added value to REF articles. Implications: Information to UKRI research funders about the fields in which collaboration might need to be supported or encouraged more than others. Published as: Thelwall, M., Kousha, K., Abdoli, M., Stuart, E., Makita, M., Wilson, P. & Levitt, J. (in press). Why are co-authored academic articles more cited: Higher quality or larger audience? Journal of the Association for Information Science and Technology.
- Are internationally co-authored journal articles higher quality? The UK case 2014-2020. Identifies the UoAs and countries for which international collaboration associates with a quality advantage. Implications: Information to UKRI research funders about the fields in which international collaboration might need to be supported or encouraged more than others.
- Do altmetric scores reflect article quality? Assesses the extent to which altmetrics from Altmetric.com can reflect research quality, as scored by REF panel members, showing that Tweeter counts are stronger than previously thought as research quality indicators. Whilst Mendeley readers are the strongest research quality indicator, they are slightly less strong than citation counts. Also shows that field normalised citation metrics can be worse than raw citation counts as research quality indicators for individual research fields. Implications: Altmetrics can be used with more confidence than before as early indicators of research quality, although their power varies substantially between fields. Published as: Thelwall, M., Kousha, K., Abdoli, M., Stuart, E., Makita, M., Wilson, P. & Levitt, J. (in press). Do altmetric scores reflect article quality? Evidence from the UK Research Excellence Framework 2021. Journal of the Association for Information Science and Technology.
- Is research funding always beneficial? A cross-disciplinary analysis of UK research 2014-20. Gives evidence that research declaring funding sources tends to attract higher REF scores in all UoAs. Implications: Current sector-wide incentives to seek funding for research do not appear to be detrimental to research overall in any discipline.
In the press
Nature: AI system not yet ready to help peer reviewers assess research quality
UKRI: Evaluation reports steer away from ‘automated’ UK research assessment
THE: REF robot reviewers ‘not yet ready’ to replace human judgement
ResearchProfessional News: AI not so smart at speeding up REF, experts conclude
THE: Funders mull robot reviewers for Research Excellence Framework
Nature: Should AI have a role in assessing research quality?
LSE Impact Blog (self written) blog post: Can artificial intelligence assess the quality of academic journal articles in the next REF?