Apprentissage supervisé d'un étiqueteur morphosyntaxique automatique de la langue amazighe

Authors

Mohamed OUTAHAJALA
UNIVERSITE MOHAMED V - RABAT

Keywords:

Computational linguistics, machine learning, POS tagging, Amazigh language, SVMs, CRFs, NLP

Synopsis

Not unlike most languages that have recently been investigated under a Natural Language Processing (NLP) approach, Amazigh suffers from the scarcity of resources and NLP tools. With the above as background, the main aim of this thesis is to provide this language with its first full-fledged speech (POS) tagger.
POS tagging annotation may well be viewed as the first layer above the lexical level and the lowest level in syntactic analysis along with all the NLP tasks dealing with higher linguistic levels. This task produces additional information for input texts, which is effective for other NLP tasks that make use of it.
In order to develop a POS tagger for the Amazigh language, we trained two sequence labeling models, namely Supp01i Vector Machines (SVMs) and Conditional Random Fields (CRFs), using a tokenizing preprocessing step. ln our experirnents, we have used the l O fold cross validation method to evaluate our approach. The obtained results are very promising, even with a small size of labeled data of about 20k words. White creating labeled data for under resourced languages is a hard task, obtaining raw data, notwithstanding the time they require for their preprocessing, is less costly. W e have explored the use of externat resources to improve the perfonnance of the tagger. We have, also, built a corpus of about a quarter million words; the infonnativeness of the 11011-vocabulary words as well as confidence measure have been used to reduce the error rate of the tagger. To improve the accuracy of our tagger, we have used a lexical resource which includes grammatical labels.

Downloads

Download data is not yet available.

References

Boukhris, F. Boumalk, A. El moujahid, E., & Souifi, H. 2008. La nouvelle grammaire de /'amazighe. Publications de !'IRCAM.

Boukouss, A. 1995. Société, langues et cultures au Maroc. Publications de la Faculté des Lettres de Rabat, Maroc.

Boukouss, A. 2012. Revitalisation de la langue amazighe: défis, enjeux et stratégies. Publications de ] 'IRCAM.

Boulaknadel, S. 2009. Amazigh ConCorde: An Appropriate Concordance for Amazigh. ln Proceedings of Ier Symposium International sur le Traitement Automatique de la Culture AMazighe (SITACAM). Agadir, Morocco.

Boulaknadel, S., & Ataa Allah, F. 2011. Building a Standard Amazigh Corpus. ln Proceedings of International Co,iference on Intelligent Human Computer Interaction. Prague, Tchec.

Boumalk, A., & Naît Zenad, K. 2009. Vocabulaire grammatical. Publications de !'IRCAM.

Brill, E. 1992. A Simple Rule-Based Pa11 Of Speech Tagger. ln Proceedings of the Third Conference on Applied Natural Language Processing.

Published

June 29, 2023

Series

Details about this monograph

Physical Dimensions