Unsupervised Taxonomy Learning /

Mahmoud Mostafa Hosny

Unsupervised Taxonomy Learning / Mahmoud Mostafa Hosny - 2014 - 57 p. ill. 21 cm.

Supervisor: Mahmoud Allam

Thesis (M.A.)—Nile University, Egypt, 2014 .

"Includes bibliographical references"

Contents:
Dedication ................................................................................................................... iv
Acknowledgments......................................................................................................... v
List of Tables ............................................................................................................ viii
List of Figures ............................................................................................................. ix
Abstract ......................................................................................................................... x
Introduction ................................................................................................................... 1
Motivation ......................................................................................................... 1
Contribution ...................................................................................................... 2
The proposed unsupervised taxonomy learning approach ................................ 3
Structure of the thesis........................................................................................ 4
Background and Related work ...................................................................................... 5
Text Mining ...................................................................................................... 5
Data collection ......................................................................................... 6
Language Identification .................................................................. 6
Data preprocessing ................................................................................... 7
Tokenization ................................................................................... 7
Normalization of data ..................................................................... 7
Stop words removal ........................................................................ 8
Data representation .................................................................................. 8
Text Mining ............................................................................................. 9
Text classification ........................................................................... 9
Unsupervised taxonomy learning ................................................................... 11
Wikipedia ........................................................................................................ 14
Wikipedia knowledge inclusion in text mining .............................................. 17
Unsupervised Taxonomy Learning ............................................................................. 20
Document preparation ..................................................................................... 21
Keyphrase extraction ...................................................................................... 21
Category extraction ......................................................................................... 22
Querying Wikipedia ...............................................................................
Category parsing .................................................................................... 24
Category refining ................................................................................... 25
Taxonomy building ......................................................................................... 27
Case Study .................................................................................................................. 29
Unsupervised taxonomy scheme generation from Arabic dataset .................. 29
Wikipedia Dataset ........................................................................................... 33
Local Wikipedia article dataset ............................................................. 33
Building local Wikipedia category tree ................................................ 35
Implementation ............................................................................................... 37
The Document preparation module ...................................................... 37
The Keyphrase extraction module ........................................................ 38
The Category extraction module ........................................................... 39
The Taxonomy building module ........................................................... 39
Experimental results........................................................................................ 40
Conclusion and future work ........................................................................................ 46
Future work ..................................................................................................... 46
Results improvements ........................................................................... 46
Conformation improvements ................................................................ 47
References ................................................................................................................... 48
Appendix: Sample category list and associated keyphrases ....................................... 54

Abstract:
The ability of effectively organizing textual information is one of the great challenges in intelligent text processing. This is especially becoming more essential with the increasing amount of data that are continuously being generated. A key technique in the organization of information is automated text classification. The classification accuracy achieved by automated methods and approaches have shown good results and performance as effective as its human comparative, thus making text classification an attractive technique for information organization. However, automated text classification techniques depend on predefined classification schemes and training datasets in order to correctly accomplish their goals. Manual input procedure and development for building classification schemes and taxonomies are prone to biases and errors and is extremely costly and time consuming especially with large amounts of data. This justifies the need for methodologies and approaches that enable the automation of the process. In this thesis we present an unsupervised computer-aided tool for automatically building classification schemes and taxonomies for enhancing the process of automated text classification. The tool utilizes the Wikipedia knowledge base and its categorization system. Validation of the tool was done using a subset of a large language dataset obtained from the Google moderator series (Egypt 2.0) idea bank. The output of the tool was evaluated by comparing the similarity between the results obtained automatically from the tool, and those set manually by three different human evaluators, verifying the effectiveness of the tool. The tool showed effectiveness with a precision of 88.6% and recall of 81.2%.


Text in English, abstracts in English .


Software Engineering


Dissertation, Academic

627