MARC View

000			07861nam a22002537a 4500
008			210111b2014 a\|\|\|f mb\|\| 00\| 0 eng d
040			_aEG-CaNU _cEG-CaNU
041	0		_aeng _beng
082			_a627
100	0		_aMahmoud Mostafa Hosny
245	1		_aUnsupervised Taxonomy Learning / _cMahmoud Mostafa Hosny
260			_c2014
300			_a57 p. _bill. _c21 cm.
500			_3Supervisor: Mahmoud Allam
502			_aThesis (M.A.)—Nile University, Egypt, 2014 .
504			_a"Includes bibliographical references"
505	0		_aContents: Dedication ................................................................................................................... iv Acknowledgments......................................................................................................... v List of Tables ............................................................................................................ viii List of Figures ............................................................................................................. ix Abstract ......................................................................................................................... x Introduction ................................................................................................................... 1 Motivation ......................................................................................................... 1 Contribution ...................................................................................................... 2 The proposed unsupervised taxonomy learning approach ................................ 3 Structure of the thesis........................................................................................ 4 Background and Related work ...................................................................................... 5 Text Mining ...................................................................................................... 5 Data collection ......................................................................................... 6 Language Identification .................................................................. 6 Data preprocessing ................................................................................... 7 Tokenization ................................................................................... 7 Normalization of data ..................................................................... 7 Stop words removal ........................................................................ 8 Data representation .................................................................................. 8 Text Mining ............................................................................................. 9 Text classification ........................................................................... 9 Unsupervised taxonomy learning ................................................................... 11 Wikipedia ........................................................................................................ 14 Wikipedia knowledge inclusion in text mining .............................................. 17 Unsupervised Taxonomy Learning ............................................................................. 20 Document preparation ..................................................................................... 21 Keyphrase extraction ...................................................................................... 21 Category extraction ......................................................................................... 22 Querying Wikipedia ............................................................................... Category parsing .................................................................................... 24 Category refining ................................................................................... 25 Taxonomy building ......................................................................................... 27 Case Study .................................................................................................................. 29 Unsupervised taxonomy scheme generation from Arabic dataset .................. 29 Wikipedia Dataset ........................................................................................... 33 Local Wikipedia article dataset ............................................................. 33 Building local Wikipedia category tree ................................................ 35 Implementation ............................................................................................... 37 The Document preparation module ...................................................... 37 The Keyphrase extraction module ........................................................ 38 The Category extraction module ........................................................... 39 The Taxonomy building module ........................................................... 39 Experimental results........................................................................................ 40 Conclusion and future work ........................................................................................ 46 Future work ..................................................................................................... 46 Results improvements ........................................................................... 46 Conformation improvements ................................................................ 47 References ................................................................................................................... 48 Appendix: Sample category list and associated keyphrases ....................................... 54
520	3		_aAbstract: The ability of effectively organizing textual information is one of the great challenges in intelligent text processing. This is especially becoming more essential with the increasing amount of data that are continuously being generated. A key technique in the organization of information is automated text classification. The classification accuracy achieved by automated methods and approaches have shown good results and performance as effective as its human comparative, thus making text classification an attractive technique for information organization. However, automated text classification techniques depend on predefined classification schemes and training datasets in order to correctly accomplish their goals. Manual input procedure and development for building classification schemes and taxonomies are prone to biases and errors and is extremely costly and time consuming especially with large amounts of data. This justifies the need for methodologies and approaches that enable the automation of the process. In this thesis we present an unsupervised computer-aided tool for automatically building classification schemes and taxonomies for enhancing the process of automated text classification. The tool utilizes the Wikipedia knowledge base and its categorization system. Validation of the tool was done using a subset of a large language dataset obtained from the Google moderator series (Egypt 2.0) idea bank. The output of the tool was evaluated by comparing the similarity between the results obtained automatically from the tool, and those set manually by three different human evaluators, verifying the effectiveness of the tool. The tool showed effectiveness with a precision of 88.6% and recall of 81.2%.
546			_aText in English, abstracts in English .
650		4	_aSoftware Engineering _9211
655		7	_2NULIB _aDissertation, Academic _9187
690			_aSoftware Engineering _9211
942			_2ddc _cTH
999			_c8778 _d8778