p-adic distance and k-Nearest Neighbor classification


Kartal E., Çalişkan F., Eskişehirli B. B., Özen Z.

Neurocomputing, cilt.578, 2024 (SCI-Expanded) identifier

  • Yayın Türü: Makale / Tam Makale
  • Cilt numarası: 578
  • Basım Tarihi: 2024
  • Doi Numarası: 10.1016/j.neucom.2024.127400
  • Dergi Adı: Neurocomputing
  • Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Academic Search Premier, PASCAL, Applied Science & Technology Source, Biotechnology Research Abstracts, Compendex, Computer & Applied Sciences, INSPEC, zbMATH
  • Anahtar Kelimeler: Classification, k-NN, Machine learning, Metric, The p-adic distance
  • İstanbul Üniversitesi Adresli: Evet

Özet

The k-Nearest Neighbor (k-NN) is a well-known supervised learning algorithm. The effect of the distance used in the analysis on the k-NN performance is very important. According to Ostrowski's theorem, there are only two nontrivial absolute values on the field of rational numbers, Q, which are the usual absolute value and the p-adic absolute value for a prime p. In view of this theorem, the p-adic absolute value motivates us to calculate the p-adic distance between two samples for the k-NN algorithm. In this study, the p-adic distance on Q was coupled with the k-NN algorithm and was applied to 10 well-known public datasets containing categorical, numerical, and mixed (both categorical and numerical) type predictive attributes. Moreover, the p-adic distance performance was compared with Euclidean, Manhattan, Chebyshev, and Cosine distances. It was seen that the average accuracy obtained from the p-adic distance ranks first in 5 out of 10 datasets. Especially, in mixed datasets, the p-adic distance gave better results than other distances. For r=1,2,3, the effect of the r-decimal values of the number for the p-adic calculation was examined on numerical and mixed datasets. In addition, the p parameter of the p-adic distance was tested with prime numbers less than 29, and it was found that the average accuracy obtained for each p was very close to each other, especially in categorical and mixed datasets. Also, it can be concluded that k-NN with the p-adic distance may be more suitable for binary classification than multi-class classification.