University of Khartoum

Arabic Multilevel Part of Speech Tagging Using Lexicon Driven Morphotactics and Viterbi

Arabic Multilevel Part of Speech Tagging Using Lexicon Driven Morphotactics and Viterbi

Show full item record

Title: Arabic Multilevel Part of Speech Tagging Using Lexicon Driven Morphotactics and Viterbi
Author: Eltahir, Majdy Mohamed Eltayeb
Abstract: The main purpose of this study is to provide a reliable morphological analysis and tagging framework that improves the Arabic Language tokenization process and resolve the inherent ambiguity in this process. The motivation is to facilitate building higher level Arabic Text Mining and other types of NLP applications such as parsers, spelling correctors, topic mining, text summarization and similar applications.The work started with a deep investigation and a literature survey regarding current approaches and techniques used for Arabic and other languages. Based on the survey, it has been decided to adopt a statistical computational linguistic approach with emphasis on corpus linguistics.The first step was to build a small corpus consisting of two components: one for training and the other for testing. The main advantage of corpus linguistics is that ambiguity can easily be dealt with and resolved using proven algorithms and n-gram statistics. A simple morphological analyzer has then been used to train the system with the intervention of human linguistic experts. The general idea is that algorithm would perform the basic analysis providing all major alternatives for a given word. The human expert would then select the proper analysis and dismiss all other alternatives. As expected, this phase took some time to complete.In the second phase, the trained corpus component has been used to bootstrap the upper level Viterbi algorithm to yield more detailed ranked analysis. Viterbi is a Hidden Markov Algorithm that uses a scoring process to rank all possible solutions for a given word and selects a winner. The testing component has been used to evaluate the precision and accuracy of the results. As will be seen form the text, the accuracy has been very high and in many cases higher than 90%.The achievements of the thesis may be summarized as follows: • A small corpus has been built using Arabic text from the web and other resources. The size of this corpus is 490161 words. It covers a wide range of topics. • The SWAM morphological analyzer and tagger has been implemented and used to perform the tagging process. • Based on a survey of current tag sets, we adopted a pattern oriented approach to tag the corpus. A comprehensive hierarchical tag set based on morphological patterns has been built and verified. • Based on the training data, the alogrithm has been used to compute unigram and bigram probabilities at the morpheme and word levels to capture the context and resolve all types of ambiguities. • To utilize the above statistics in resolving ambiguities and improving accuracy, the algorithm has been equipped with two new components: A Viterbi (HMM) component and a Bayes component. • Due to the fact that the input text is not vowelized, and the morphological patterns used in the training set are not vowelized, the system provides non-vowelized solution (tag) at the lower level. To resolve this problem, Viterbi has been used to produce the correct vowelized pattern (tag) for a given word. This step means that inner-word vowelization is a side effect of the tagging process.
Description: 261 Pages
URI: http://khartoumspace.uofk.edu/123456789/22592


Files in this item

Files Size Format View

This item appears in the following Collection(s)

Show full item record

Share

Search DSpace


Browse

My Account