Phrase indexing and the identification of related academic research content

24 June 2020, Version 2
This content is an early or alternative research output and has not been peer-reviewed by Cambridge University Press at the time of posting.

Abstract

Work to automate the identification of related articles in corpora of academic research content is described. Pairs of related articles are recognised on the basis of the phrases they contain, using a similarity measure that emphasizes the importance of phrase overlap. Phrases are weighted according to their significance, evaluated in terms of statistical under- or over-representation relative to corpus-level frequency, and the significance scores of n-grams with higher n values are boosted. The measure proves broadly effective at identifying meaningfully related pairs of content items and may provide a useful basis for the development of ‘see also’-type functionality.

Keywords

phrase indexing
document similarity
Jaccard coefficient
document relatedness
TSW-OM similarity measure

Supplementary materials

Title
Description
Actions
Title
Appendices
Description
Appendix A: Phrase intersections for selected document pairs Appendix B: Stop words and stop phrases
Actions
Title
File relationships data for test corpus
Description
Excel spreadsheet of file relationships
Actions
Title
Phrases indexed
Description
Excel spreadsheet of phrases indexed, with corpus frequencies and host file counts
Actions

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting and Discussion Policy [opens in a new tab] - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here [opens in a new tab] .
This site is protected by reCAPTCHA and the Google Privacy Policy [opens in a new tab] and Terms of Service [opens in a new tab] apply.