Petro Pukach

Work place: Lviv Polytechnic National University, Lviv, 79013, Ukraine

E-mail: Petro.Y.Pukach@lpnu.ua

Website: https://orcid.org/0000-0002-0359-5025

Research Interests: Computer systems and computational processes, Systems Architecture, Solid Modeling, Analysis of Algorithms, Mathematics of Computing, Theory of Computation, Models of Computation

Biography

Petro Pukach received his master’s degree in mathematics from Lviv Ivan Franko University, Lviv, Ukraine in 1990. He received the PhD degree in Mathematics (Differential Equations) from the Faculty of Mathematics and Mechanics, Lviv Ivan Franko University of Lviv, Lviv, Ukraine in 1993. He received the ScD degree in Engineering (Dynamics and Strength of Machines) from the Institute of Mechanics, Lviv Polytechnic National University, Ukraine in 2014. He is currently the Director of the Institute of Applied Mathematics and Fundamental Sciences at Lviv Polytechnic National University, Lviv, Ukraine. His research interests include applications of the computational and asymptotic methods in the mathematical modeling of complex systems and applications of the statistical methods in IT.

Author Articles
Intelligent Analysis of Ukrainian-language Tweets for Public Opinion Research based on NLP Methods and Machine Learning Technology

By Oleh Prokipchuk Victoria Vysotska Petro Pukach Vasyl Lytvyn Dmytro Uhryn Yuriy Ushenko Zhengbing Hu

DOI: https://doi.org/10.5815/ijmecs.2023.03.06, Pub. Date: 8 Jun. 2023

The article develops a technology for finding tweet trends based on clustering, which forms a data stream in the form of short representations of clusters and their popularity for further research of public opinion. The accuracy of their result is affected by the natural language feature of the information flow of tweets. An effective approach to tweet collection, filtering, cleaning and pre-processing based on a comparative analysis of Bag of Words, TF-IDF and BERT algorithms is described. The impact of stemming and lemmatization on the quality of the obtained clusters was determined. Stemming and lemmatization allow for significant reduction of the input vocabulary of Ukrainian words by 40.21% and 32.52% respectively. And optimal combinations of clustering methods (K-Means, Agglomerative Hierarchical Clustering and HDBSCAN) and vectorization of tweets were found based on the analysis of 27 clustering of one data sample. The method of presenting clusters of tweets in a short format is selected. Algorithms using the Levenstein Distance, i.e. fuzz sort, fuzz set and Levenshtein, showed the best results. These algorithms quickly perform checks, have a greater difference in similarities, so it is possible to more accurately determine the limit of similarity. According to the results of the clustering, the optimal solutions are to use the HDBSCAN clustering algorithm and the BERT vectorization algorithm to achieve the most accurate results, and to use K-Means together with TF-IDF to achieve the best speed with the optimal result. Stemming can be used to reduce execution time. In this study, the optimal options for comparing cluster fingerprints among the following similarity search methods were experimentally found: Fuzz Sort, Fuzz Set, Levenshtein, Jaro Winkler, Jaccard, Sorensen, Cosine, Sift4. In some algorithms, the average fingerprint similarity reaches above 70%. Three effective tools were found to compare their similarity, as they show a sufficient difference between comparisons of similar and different clusters (> 20%).
The experimental testing was conducted based on the analysis of 90,000 tweets over 7 days for 5 different weekly topics: President Volodymyr Zelenskyi, Leopard tanks, Boris Johnson, Europe, and the bright memory of the deceased. The research was carried out using a combination of K-Means and TF-IDF methods, Agglomerative Hierarchical Clustering and TF-IDF, HDBSCAN and BERT for clustering and vectorization processes. Additionally, fuzz sort was implemented for comparing cluster fingerprints with a similarity threshold of 55%. For comparing fingerprints, the most optimal methods were fuzz sort, fuzz set, and Levenshtein. In terms of execution speed, the best result was achieved with the Levenshtein method. The other two methods performed three times worse in terms of speed, but they are nearly 13 times faster than Sift4. The fastest method is Jaro Winkler, but it has a 19.51% difference in similarities. The method with the best difference in similarities is fuzz set (60.29%). Fuzz sort (32.28%) and Levenshtein (28.43%) took the second and third place respectively. These methods utilize the Levenshtein distance in their work, indicating that such an approach works well for comparing sets of keywords. Other algorithms fail to show significant differences between different fingerprints, suggesting that they are not adapted to this type of task.

[...] Read more.
Other Articles