Apprentissage supervisé d'un étiqueteur morphosyntaxique automatique de la langue amazighe
Keywords:
Computational linguistics, machine learning, POS tagging, Amazigh language, SVMs, CRFs, NLPSynopsis
Not unlike most languages that have recently been investigated under a Natural Language Processing (NLP) approach, Amazigh suffers from the scarcity of resources and NLP tools. With the above as background, the main aim of this thesis is to provide this language with its first full-fledged speech (POS) tagger.
POS tagging annotation may well be viewed as the first layer above the lexical level and the lowest level in syntactic analysis along with all the NLP tasks dealing with higher linguistic levels. This task produces additional information for input texts, which is effective for other NLP tasks that make use of it.
In order to develop a POS tagger for the Amazigh language, we trained two sequence labeling models, namely Supp01i Vector Machines (SVMs) and Conditional Random Fields (CRFs), using a tokenizing preprocessing step. ln our experirnents, we have used the l O fold cross validation method to evaluate our approach. The obtained results are very promising, even with a small size of labeled data of about 20k words. White creating labeled data for under resourced languages is a hard task, obtaining raw data, notwithstanding the time they require for their preprocessing, is less costly. W e have explored the use of externat resources to improve the perfonnance of the tagger. We have, also, built a corpus of about a quarter million words; the infonnativeness of the 11011-vocabulary words as well as confidence measure have been used to reduce the error rate of the tagger. To improve the accuracy of our tagger, we have used a lexical resource which includes grammatical labels.
Downloads
References
Boukhris, F. Boumalk, A. El moujahid, E., & Souifi, H. 2008. La nouvelle grammaire de /'amazighe. Publications de !'IRCAM.
Boukouss, A. 1995. Société, langues et cultures au Maroc. Publications de la Faculté des Lettres de Rabat, Maroc.
Boukouss, A. 2012. Revitalisation de la langue amazighe: défis, enjeux et stratégies. Publications de ] 'IRCAM.
Boulaknadel, S. 2009. Amazigh ConCorde: An Appropriate Concordance for Amazigh. ln Proceedings of Ier Symposium International sur le Traitement Automatique de la Culture AMazighe (SITACAM). Agadir, Morocco.
Boulaknadel, S., & Ataa Allah, F. 2011. Building a Standard Amazigh Corpus. ln Proceedings of International Co,iference on Intelligent Human Computer Interaction. Prague, Tchec.
Boumalk, A., & Naît Zenad, K. 2009. Vocabulaire grammatical. Publications de !'IRCAM.
Brill, E. 1992. A Simple Rule-Based Pa11 Of Speech Tagger. ln Proceedings of the Third Conference on Applied Natural Language Processing.