Accelerating scientific discovery using domain adaptive language modelling (PhD Thesis)

Thesis on QUB Pure Portal
Thesis in PDF Format

Author: Dimitrios Christofidellis

Research has been conducted for numerous centuries but recent advances in technology have facilitated and accelerated the process keeping the research budget and the required effort at manageable levels. Scientific and technical corpora, such as papers and patents, are great written sources of already existing research knowledge and information. The abundance of such documents, in addition to their exponential growth, set them as a unique source of knowledge offering a great opportunity to push the research boundaries even further. Yet, this information’s volume and growth rate are so large that it is unmanageable for researchers to study all of it. Realizing the potential of incorporating this knowledge efficiently into the discovery process and that the recent advances in the NLP domain provide us with a powerful methodological base, our work aims to establish methods that can speed up parts of the discovery process relying on scientific and technical corpora. We focus on but do not limit our work on patent corpora as methods to leverage such documents in discovery pipelines are limited so far. Our contributions focus on three specific cases: the domain definition of a given corpus in the form of a metagraph; the domain definition of a given corpus in the form of keywords, focusing on the patent classification case; and the semi-automated reporting of a discovery artifact in the form of a patent. In all three cases, we rely on transformer-based Language Models and adhere to domain adaptive techniques to achieve our goals by providing efficient methods in terms of both performance and needed training/inference requirements. Concluding our work, we discuss the importance of our contributions. We demonstrate how our proposed methods can be incorporated into discovery pipelines, combined, and complement existing methods. We conclude with a discussion of promising future directions derived from our work.