Apprentissage supervisé d'un étiqueteur morphosyntaxique automatique de la langue amazighe
Computational linguistics, machine learning, POS tagging, Amazigh language, SVMs, CRFs, NLPSynopsis
Not unlike most languages that have recently been investigated under a Natural Language Processing (NLP) approach, Amazigh suffers from the scarcity of resources and NLP tools. With the above as background, the main aim of this thesis is to provide this language with its first full-fledged speech (POS) tagger.
POS tagging annotation may well be viewed as the first layer above the lexical level and the lowest level in syntactic analysis along with all the NLP tasks dealing with higher linguistic levels. This task produces additional information for input texts, which is effective for other NLP tasks that make use of it.
In order to develop a POS tagger for the Amazigh language, we trained two sequence labeling models, namely Supp01i Vector Machines (SVMs) and Conditional Random Fields (CRFs), using a tokenizing preprocessing step. ln our experirnents, we have used the l O fold cross validation method to evaluate our approach. The obtained results are very promising, even with a small size of labeled data of about 20k words. White creating labeled data for under resourced languages is a hard task, obtaining raw data, notwithstanding the time they require for their preprocessing, is less costly. W e have explored the use of externat resources to improve the perfonnance of the tagger. We have, also, built a corpus of about a quarter million words; the infonnativeness of the 11011-vocabulary words as well as confidence measure have been used to reduce the error rate of the tagger. To improve the accuracy of our tagger, we have used a lexical resource which includes grammatical labels.
