Comparison between Logistic Regression and K-Nearest Neighbour Techniques with Application on Thalassemia Patients in Mosul

Mohammed Faris Al jbory; Hutheyfa Hazam Taha

doi:10.33899/iqjoss.2025.187789

Comparison between Logistic Regression and K-Nearest Neighbour Techniques with Application on Thalassemia Patients in Mosul

Section: Research Paper

Issue

Vol. 22 No. 1 (2025): Volume 22 Issue 1

Published

May 1, 2025

Pages

151-167

Abstract

Thalassemia is a genetic disease that is transmitted from parents to children when both parents are carriers of the genetic mutation. This change leads to a decrease in the number, quality, and condition of red blood platelets and an increase in the rate of red blood platelet damage, which leads to iron accumulation in the body and a decrease in hemoglobin in the blood. This project aims to develop a model to predict thalassemia using the nearest neighbor technique and the logistic regression model based on the model evaluation criteria: accuracy, recall, precision, F1-score, and AUC. The data were obtained from Al-Hadbaa Specialized Hospital in Mosul. The data set included 280 observations, of which 149 (53.21%) were thalassemia intermedia and 131 (46.78%) were thalassemia major. The data was divided into 70% for training and 30% for screening.The experimental results showed that the logistic regression model performed better than the nearest neighbor algorithm with a precision of 96%, recall of 98%, and F1- score of 97% in the thalassemia intermedia category, while it had a precision of 97%, recall of 95%, and F1- score of 96% in the thalassemia major category, indicating that logistic regression performed well in distinguishing between these two categories. it has been shown that logistic regression is more effective than the K-nearest neighbor algorithm in classifying thalassemia patients, especially those with thalassemia major. The study showed that the type of distance used in the K-nearest neighbor algorithm, whether "Manhattan" or "Chebyshev", has a significant impact on the accuracy of predictions, with the highest accuracy reaching 95% when K = 4. It was also shown that the difference between distance calculation methods and the K value plays a major role in improving the classification results, as it was determined that the optimal value for K is 4, which led to improving the accuracy of predictions. The researcher suggests increasing the data size, as it is possible to improve the accuracy of models by increasing the data size. In addition, the researcher recommends using other artificial intelligence techniques, especially neural networks, to verify any additional improvements.

References

Arora, I., Khanduja, N., & Bansal, M. (2022). Effect of Distance Metric and Feature Scaling on KNN Algorithm while Classifying X-rays. CEUR Workshop Proceedings, 3176, 6175.
Bakumenko, A., & Elragal, A. (2022). Detecting Anomalies in Financial Data Using Machine Learning Algorithms. Systems, 10(5). https://doi.org/10.3390/systems10050130
Borah, M. S., Bhuyan, B. P., Pathak, M. S., & Bhattacharya, P. K. (2018). Machine learning in predicting hemoglobin variants. International Journal of Machine Learning and Computing, 8(2), 140143. https://doi.org/10.18178/ijmlc.2018.8.2.677
de Oliveira, N. R., Pisa, P. S., Lopez, M. A., de Medeiros, D. S. V., & Mattos, D. M. F. (2021). Identifying fake news on social networks based on natural language processing: Trends and challenges. Information (Switzerland), 12(1), 132. https://doi.org/10.3390/info12010038
Gao, X., & Li, G. (2020). A KNN Model Based on Manhattan Distance to Identify the SNARE Proteins. IEEE Access, 8, 112922112931. https://doi.org/10.1109/ACCESS.2020.3003086
Ghosh, J., Li, Y., & Mitra, R. (2018). On the use of cauchy prior distributions. Bayesian Analysis, 13(2), 359383.
Hartini, S., & Rustam, Z. (2019). Hierarchical clustering algorithm based on density peaks using kernel function for thalassemia classification. Journal of Physics: Conference Series, 1417(1), 12016.
Karlsson, S. (2017). Using semantic folding with TextRank for automatic summarization. 58.
M Gail, K. Krickeberg, J. M. S. (2010). Statistics for Biology and Health. In Media.
Maalouf, M. (2011). Logistic regression in data analysis: An overview. International Journal of Data Analysis Techniques and Strategies, 3(3), 281299. https://doi.org/10.1504/IJDATS.2011.041335
Paokanta, P., Ceccarelli, M., & Srichairatanakool, S. (2010). The effeciency of data types for classification performance of machine learning techniques for screening -Thalassemia. 2010 3rd International Symposium on Applied Sciences in Biomedical and Communication Technologies, ISABEL 2010, 14. https://doi.org/10.1109/ISABEL.2010.5702769
Prakisya, N. P. T., Liantoni, F., Hatta, P., Aristyagama, Y. H., & Setiawan, A. (2021). Utilization of K-nearest neighbor algorithm for classification of white blood cells in AML M4, M5, and M7. Open Engineering, 11(1), 662668. https://doi.org/10.1515/eng-2021-0065
Prasath, V. B. S., Alfeilat, H. A. A., Hassanat, A. B. A., Lasassmeh, O., Tarawneh, A. S., Alhasanat, M. B., & Salman, H. S. E. (2017). Distance and Similarity Measures Effect on the Performance of K-Nearest Neighbor Classifier -- A Review. 139. https://doi.org/10.1089/big.2018.0175
Rithesh, R. N. (2017). SVM-KNN: A Novel Approach to Classification Based on SVM and KNN. International Research Journal of Computer Science, 4(8), 4349. https://doi.org/10.26562/irjcs.2017.aucs10088
Sergue, M. (2020). Customer Churn Analysis and Prediction using Machine Learning for a B2B SaaS company. www.kth.se/sci
Steinbach, M., & Tan, P.-N. (2009). kNN: k-nearest neighbors. The top ten algorithms in data mining, 151-162 .
Yousefian, F., Banirostam, T., & AzarKeivan, A. (2017). Prediction of Mellitus Diabetes in Patients with Beta-thalassemia using Radial Basis Network, and k-Nearest Neighbor based on Zafar Thalassemia Datasets. Diabetes, 19, 20.

Authors

Mohammed Faris Al jbory

Department of Statistics and Informatics, College of Computer Science and Mathematics University of Mosul, Mosul, Iraq

ORCID

Hutheyfa Hazam Taha

Department of Statistics and Informatics, College of Computer Science and Mathematics University of Mosul, Mosul, Iraq

ORCID

Identifiers

https://doi.org/10.33899/iqjoss.2025.187789

Download this PDF file

PDF

Statistics

How to Cite

Faris Al jbory, M., & Hazam Taha, H. (2025). Comparison between Logistic Regression and K-Nearest Neighbour Techniques with Application on Thalassemia Patients in Mosul. IRAQI JOURNAL OF STATISTICAL SCIENCES, 22(1), 151–167. https://doi.org/10.33899/iqjoss.2025.187789