Publication:
Lemmatization for Azerbaijani Language Using Attention Mechanism

Loading...
Thumbnail Image

Journal Title

Journal ISSN

Volume Title

Publisher

Research Projects

Organizational Units

Journal Issue

Abstract

Lemmatization is often handled using rule-based methods, which work well for many languages. However, it struggles with agglutinative languages such as Azerbaijani, where writing all the necessary rules can be challenging and sometimes impossible. While preprocessing steps like lemmatization and stemming are becoming less crucial in well-studied languages such as English, they remain critical for agglutinative languages with complex structures such as Azerbaijani. Recently, the current state-of-the-art hybrid lemmatizer for Azerbaijani has improved sentiment analysis results by 8.9%. This improvement underscores the continued importance of these preprocessing steps in agglutinative languages. This thesis introduces a novel approach to lemmatization for the Azerbaijani language utilizing an attention-based sequence-to-sequence model, achieving 98% character-level accuracy across a 91941-word sample and 96% word-level accuracy, surpassing the 95% word-level accuracy of the of current state-of-the-art hybrid lemmatizer. This approach demonstrates the potential to replace rule-based or hybrid lemmatization methods with a fully machine-learning-based model, eliminating the need to manually write complex rules for agglutinative languages.

Description

Keywords

Natural Language Processing (NLP), Machine Learning, Deep Learning, Sequence-to-Sequence Model, Attention Mechanism, Morphological Analysis, Agglutinative Languages, Computational Linguistics, Text Preprocessing

Citation

Endorsement

Review

Supplemented By

Referenced By