The study is presented in two companion papers that each provides a different perspective of the analysis.
The first paper describes the corpus and presents an overall analysis of the number of papers, authors, gender distributions, co-authorship, collaboration patterns and citation patterns.
The Research Topic on “NLP-enhanced Bibliometrics” aims to promote interdisciplinary research in bibliometrics, Natural Language Processing (NLP) and computational linguistics in order to enhance the ways bibliometrics can benefit from large-scale text analytics and sense mining of papers.
The objectives of such research are to provide insights into scientific writing and bring new perspectives to the understanding of both the nature of citations and the nature of scientific papers and their internal structures.
More than 36,000 papers in environmental sciences, retrieved from the ISTEX database, were processed to observe the trends in the GEM score over an 80-year period of time.
The results show that abstracts tend to be more generous in recent publications and there seems to be no correlation between the GEM score and the citation rate of the papers.
The second paper investigates the research topics and their evolution over time, the key innovative topics and the authors that introduced them, and also the reuse of papers and plagiarism.
Together, the two papers provide a survey of the literature in NLP and SLP and the data to understand the trends and the evolution of research in this research community.
Some of the open source tools for text processing that have been recently applied to such tasks include NLTK, Mallet, Open NLP, Core NLP, Gate, Cite Space, Allen NLP, and others.
Many datasets are now freely available for the community: e.g., Pub Med OA, Cite Seer X, JSTOR, ISTEX, Microsoft Academic Graph, ACL anthology, etc.