Azerbaijan Text Clustering using Machine Learning Methods
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
In this digital era, the explosion of textual data is causing us to develop sophisticated text mining and clustering methods. Although the state of art has improved for most well-resourced languages, relatively little research had been carried out on a language with smaller resource like Azerbaijani. In this thesis I investigated using clustering algorithms to enhance the information and communication access in Azerbaijani speaking community. 15,500 news articles were used compiled as a part of oxu.az. So, K-means, Fuzzy-Kmeans, Agglomerative Hierarchical Clustering, Spectral Clustering along with Gaussian Mixture Model (GMM) and Latent Dirichlet Allocation were deployed. They were evaluated on the basis of Silhouette Score (SS) and Davies-Bouldin Index. Word2Vec embeddings yield higher ARI than TF-IDF, while Spectral Clustering and LDA report superior scores owing to their capability of mapping complex workout nodes. The future works will improve the Pre-processing, hybrid Clustering and Deep Learning Embeddings. Applications to real-world problems ranging from recommendation systems and content categorization, all of which will build experience with the models.

