Login | Register

A Supervised Learning Approach for Imbalanced Text Classification of Biomedical Literature Triage


A Supervised Learning Approach for Imbalanced Text Classification of Biomedical Literature Triage

Almeida, Hayda (2015) A Supervised Learning Approach for Imbalanced Text Classification of Biomedical Literature Triage. Masters thesis, Concordia University.

[thumbnail of ALMEIDA_MCompSc_S2015.pdf]
Text (application/pdf)
ALMEIDA_MCompSc_S2015.pdf - Accepted Version
Available under License Spectrum Terms of Access.


This thesis presents the development of a machine learning system, called mycoSORT , for supporting the first step of the biological literature manual curation process, called triage. The manual triage of documents is very demanding, as researchers usually face the time-consuming and error-
prone task of screening a large amount of data to identify relevant information. After querying scientific databases for keywords related to a specific subject, researchers generally find a long list of retrieved results, that has to be carefully analysed to identify only a few documents that show a potential of being relevant to the topic. Such an analysis represents a severe bottleneck in the
knowledge discovery and decision-making processes in scientific research. Hence, biocurators could
greatly benefit from an automatic support when performing the triage task. In order to support the triage of scientific documents, we have used a corpus of document instances manually labeled by biocurators as “selected” or “rejected”, with regards to their potential to indicate relevant information about fungal enzymes. This document collection is characterized by being large, since many results are retrieved and analysed to finally identify potential candidate documents; and also highly imbalanced, concerning the distribution of instances per relevance: the great majority of documents are labeled as rejected, while only a very small portion are labeled as selected. Using this dataset, we studied the design of a classification model to identify the most discriminative features
to automate the triage of scientific literature and to tackle the imbalance between the two classes of documents. To identify the most suitable model, we performed a study of 324 classification models, which demonstrated the results of using 9 different data undersampling factors, 4 sets of features, and the evaluation of 2 feature selection methods as well as 3 machine learning algorithms. Our
results demonstrated that the use of an undersampling technique is effective to handle imbalanced datasets and also help manage large document collections. We also found that the combination of undersampling and feature selection using Odds Ratio can improve the performance of our classification model. Finally, our results demonstrated that the best fitting model to support the triage of scientific documents is composed by domain relevant features, filtered by Odds Ratio scores, the use of dataset undersampling and the Logistic Model Trees algorithm.

Divisions:Concordia University > Gina Cody School of Engineering and Computer Science > Computer Science and Software Engineering
Item Type:Thesis (Masters)
Authors:Almeida, Hayda
Institution:Concordia University
Degree Name:M. Comp. Sc.
Program:Computer Science
Date:April 2015
Thesis Supervisor(s):Kosseim, Leila and Meurs, Marie-Jean
Keywords:text classification, imbalance learning, machine learning, data sampling, undersampling, triage, literature curation, biomedical literature, bioinformatics, fungal genomics
ID Code:979834
Deposited On:13 Jul 2015 15:44
Last Modified:18 Jan 2018 17:50
All items in Spectrum are protected by copyright, with all rights reserved. The use of items is governed by Spectrum's terms of access.

Repository Staff Only: item control page

Downloads per month over past year

Research related to the current document (at the CORE website)
- Research related to the current document (at the CORE website)
Back to top Back to top