Comparison between Logistic Regression and K-Nearest Neighbour Techniques with Application on Thalassemia Patients in Mosul
Abstract
Thalassemia is a genetic disease that is transmitted from parents to children when both parents are carriers of the genetic mutation. This change leads to a decrease in the number, quality, and condition of red blood platelets and an increase in the rate of red blood platelet damage, which leads to iron accumulation in the body and a decrease in hemoglobin in the blood. This project aims to develop a model to predict thalassemia using the nearest neighbor technique and the logistic regression model based on the model evaluation criteria: accuracy, recall, precision, F1-score, and AUC. The data were obtained from Al-Hadbaa Specialized Hospital in Mosul. The data set included 280 observations, of which 149 (53.21%) were thalassemia intermedia and 131 (46.78%) were thalassemia major. The data was divided into 70% for training and 30% for screening.The experimental results showed that the logistic regression model performed better than the nearest neighbor algorithm with a precision of 96%, recall of 98%, and F1- score of 97% in the thalassemia intermedia category, while it had a precision of 97%, recall of 95%, and F1- score of 96% in the thalassemia major category, indicating that logistic regression performed well in distinguishing between these two categories. it has been shown that logistic regression is more effective than the K-nearest neighbor algorithm in classifying thalassemia patients, especially those with thalassemia major. The study showed that the type of distance used in the K-nearest neighbor algorithm, whether "Manhattan" or "Chebyshev", has a significant impact on the accuracy of predictions, with the highest accuracy reaching 95% when K = 4. It was also shown that the difference between distance calculation methods and the K value plays a major role in improving the classification results, as it was determined that the optimal value for K is 4, which led to improving the accuracy of predictions. The researcher suggests increasing the data size, as it is possible to improve the accuracy of models by increasing the data size. In addition, the researcher recommends using other artificial intelligence techniques, especially neural networks, to verify any additional improvements.
References
- Arora, I., Khanduja, N., & Bansal, M. (2022). Effect of Distance Metric and Feature Scaling on KNN Algorithm while Classifying X-rays. CEUR Workshop Proceedings, 3176, 6175.
- Bakumenko, A., & Elragal, A. (2022). Detecting Anomalies in Financial Data Using Machine Learning Algorithms. Systems, 10(5). https://doi.org/10.3390/systems10050130
- Borah, M. S., Bhuyan, B. P., Pathak, M. S., & Bhattacharya, P. K. (2018). Machine learning in predicting hemoglobin variants. International Journal of Machine Learning and Computing, 8(2), 140143. https://doi.org/10.18178/ijmlc.2018.8.2.677
- de Oliveira, N. R., Pisa, P. S., Lopez, M. A., de Medeiros, D. S. V., & Mattos, D. M. F. (2021). Identifying fake news on social networks based on natural language processing: Trends and challenges. Information (Switzerland), 12(1), 132. https://doi.org/10.3390/info12010038
- Gao, X., & Li, G. (2020). A KNN Model Based on Manhattan Distance to Identify the SNARE Proteins. IEEE Access, 8, 112922112931. https://doi.org/10.1109/ACCESS.2020.3003086
- Ghosh, J., Li, Y., & Mitra, R. (2018). On the use of cauchy prior distributions. Bayesian Analysis, 13(2), 359383.
- Hartini, S., & Rustam, Z. (2019). Hierarchical clustering algorithm based on density peaks using kernel function for thalassemia classification. Journal of Physics: Conference Series, 1417(1), 12016.
- Karlsson, S. (2017). Using semantic folding with TextRank for automatic summarization. 58.
- M Gail, K. Krickeberg, J. M. S. (2010). Statistics for Biology and Health. In Media.
- Maalouf, M. (2011). Logistic regression in data analysis: An overview. International Journal of Data Analysis Techniques and Strategies, 3(3), 281299. https://doi.org/10.1504/IJDATS.2011.041335
- Paokanta, P., Ceccarelli, M., & Srichairatanakool, S. (2010). The effeciency of data types for classification performance of machine learning techniques for screening -Thalassemia. 2010 3rd International Symposium on Applied Sciences in Biomedical and Communication Technologies, ISABEL 2010, 14. https://doi.org/10.1109/ISABEL.2010.5702769
- Prakisya, N. P. T., Liantoni, F., Hatta, P., Aristyagama, Y. H., & Setiawan, A. (2021). Utilization of K-nearest neighbor algorithm for classification of white blood cells in AML M4, M5, and M7. Open Engineering, 11(1), 662668. https://doi.org/10.1515/eng-2021-0065
- Prasath, V. B. S., Alfeilat, H. A. A., Hassanat, A. B. A., Lasassmeh, O., Tarawneh, A. S., Alhasanat, M. B., & Salman, H. S. E. (2017). Distance and Similarity Measures Effect on the Performance of K-Nearest Neighbor Classifier -- A Review. 139. https://doi.org/10.1089/big.2018.0175
- Rithesh, R. N. (2017). SVM-KNN: A Novel Approach to Classification Based on SVM and KNN. International Research Journal of Computer Science, 4(8), 4349. https://doi.org/10.26562/irjcs.2017.aucs10088
- Sergue, M. (2020). Customer Churn Analysis and Prediction using Machine Learning for a B2B SaaS company. www.kth.se/sci
- Steinbach, M., & Tan, P.-N. (2009). kNN: k-nearest neighbors. The top ten algorithms in data mining, 151-162 .
- Yousefian, F., Banirostam, T., & AzarKeivan, A. (2017). Prediction of Mellitus Diabetes in Patients with Beta-thalassemia using Radial Basis Network, and k-Nearest Neighbor based on Zafar Thalassemia Datasets. Diabetes, 19, 20.