Publication: Lemmatization for Azerbaijani Language Using Attention Mechanism
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Lemmatization is often handled using rule-based methods, which work well for many languages. However, it struggles with agglutinative languages such as Azerbaijani, where writing all the necessary rules can be challenging and sometimes impossible. While preprocessing steps like lemmatization and stemming are becoming less crucial in well-studied languages such as English, they remain critical for agglutinative languages with complex structures such as Azerbaijani. Recently, the current state-of-the-art hybrid lemmatizer for Azerbaijani has improved sentiment analysis results by 8.9%. This improvement underscores the continued importance of these preprocessing steps in agglutinative languages. This thesis introduces a novel approach to lemmatization for the Azerbaijani language utilizing an attention-based sequence-to-sequence model, achieving 98% character-level accuracy across a 91941-word sample and 96% word-level accuracy, surpassing the 95% word-level accuracy of the of current state-of-the-art hybrid lemmatizer. This approach demonstrates the potential to replace rule-based or hybrid lemmatization methods with a fully machine-learning-based model, eliminating the need to manually write complex rules for agglutinative languages.
