000 07861nam a22002537a 4500
008 210111b2014 a|||f mb|| 00| 0 eng d
040 _aEG-CaNU
_cEG-CaNU
041 0 _aeng
_beng
082 _a627
100 0 _aMahmoud Mostafa Hosny
245 1 _aUnsupervised Taxonomy Learning /
_cMahmoud Mostafa Hosny
260 _c2014
300 _a57 p.
_bill.
_c21 cm.
500 _3Supervisor: Mahmoud Allam
502 _aThesis (M.A.)—Nile University, Egypt, 2014 .
504 _a"Includes bibliographical references"
505 0 _aContents: Dedication ................................................................................................................... iv Acknowledgments......................................................................................................... v List of Tables ............................................................................................................ viii List of Figures ............................................................................................................. ix Abstract ......................................................................................................................... x Introduction ................................................................................................................... 1 Motivation ......................................................................................................... 1 Contribution ...................................................................................................... 2 The proposed unsupervised taxonomy learning approach ................................ 3 Structure of the thesis........................................................................................ 4 Background and Related work ...................................................................................... 5 Text Mining ...................................................................................................... 5 Data collection ......................................................................................... 6 Language Identification .................................................................. 6 Data preprocessing ................................................................................... 7 Tokenization ................................................................................... 7 Normalization of data ..................................................................... 7 Stop words removal ........................................................................ 8 Data representation .................................................................................. 8 Text Mining ............................................................................................. 9 Text classification ........................................................................... 9 Unsupervised taxonomy learning ................................................................... 11 Wikipedia ........................................................................................................ 14 Wikipedia knowledge inclusion in text mining .............................................. 17 Unsupervised Taxonomy Learning ............................................................................. 20 Document preparation ..................................................................................... 21 Keyphrase extraction ...................................................................................... 21 Category extraction ......................................................................................... 22 Querying Wikipedia ............................................................................... Category parsing .................................................................................... 24 Category refining ................................................................................... 25 Taxonomy building ......................................................................................... 27 Case Study .................................................................................................................. 29 Unsupervised taxonomy scheme generation from Arabic dataset .................. 29 Wikipedia Dataset ........................................................................................... 33 Local Wikipedia article dataset ............................................................. 33 Building local Wikipedia category tree ................................................ 35 Implementation ............................................................................................... 37 The Document preparation module ...................................................... 37 The Keyphrase extraction module ........................................................ 38 The Category extraction module ........................................................... 39 The Taxonomy building module ........................................................... 39 Experimental results........................................................................................ 40 Conclusion and future work ........................................................................................ 46 Future work ..................................................................................................... 46 Results improvements ........................................................................... 46 Conformation improvements ................................................................ 47 References ................................................................................................................... 48 Appendix: Sample category list and associated keyphrases ....................................... 54
520 3 _aAbstract: The ability of effectively organizing textual information is one of the great challenges in intelligent text processing. This is especially becoming more essential with the increasing amount of data that are continuously being generated. A key technique in the organization of information is automated text classification. The classification accuracy achieved by automated methods and approaches have shown good results and performance as effective as its human comparative, thus making text classification an attractive technique for information organization. However, automated text classification techniques depend on predefined classification schemes and training datasets in order to correctly accomplish their goals. Manual input procedure and development for building classification schemes and taxonomies are prone to biases and errors and is extremely costly and time consuming especially with large amounts of data. This justifies the need for methodologies and approaches that enable the automation of the process. In this thesis we present an unsupervised computer-aided tool for automatically building classification schemes and taxonomies for enhancing the process of automated text classification. The tool utilizes the Wikipedia knowledge base and its categorization system. Validation of the tool was done using a subset of a large language dataset obtained from the Google moderator series (Egypt 2.0) idea bank. The output of the tool was evaluated by comparing the similarity between the results obtained automatically from the tool, and those set manually by three different human evaluators, verifying the effectiveness of the tool. The tool showed effectiveness with a precision of 88.6% and recall of 81.2%.
546 _aText in English, abstracts in English .
650 4 _aSoftware Engineering
_9211
655 7 _2NULIB
_aDissertation, Academic
_9187
690 _aSoftware Engineering
_9211
942 _2ddc
_cTH
999 _c8778
_d8778