Studying the impact of anomalies and duplications in aerial surveying datasets on the quality of deep learning

DOI: 10.31673/2412-9070.2026.022704

Authors

  • П. О. Приставка, (Prystavka P.) Taras Shevchenko National University of Kyiv, State University «Kyiv Aviation Institute»
  • О. Г. Чолишкіна, (Cholyshkina O.) Taras Shevchenko National University of Kyiv
  • О. С. Подскребко, (Podskrebko O.) Taras Shevchenko National University of Kyiv
  • М. І. Боришкевич, (Boryshkevich M.) State University «Kyiv Aviation Institute»

DOI:

https://doi.org/10.31673/2412-9070.2026.022704

Abstract

The quality of training datasets is a critical factor affecting the performance of deep learning models in automated aerial image analysis. The presence of anomalous and duplicate samples in aerial imagery datasets leads to distortion of statistical properties, reduction of distribution entropy, and degradation of model generalization capability. This paper investigates the impact of such data imperfections on multi-class image classification performance and analyzes the effectiveness of statistical anomaly detection methods applied in a compact latent space.
A convolutional autoencoder is employed to generate low-dimensional latent representations of aerial images, providing an informative and noise-resistant space for subsequent statistical analysis. Anomalous samples are identified using the three-sigma rule, skewness and kurtosis-based analysis, and a multidimensional variation series approach. The effect of duplicate image removal is examined separately. Dataset quality is evaluated through entropy-based characteristics of the latent space and classification accuracy obtained using a ResNet50 convolutional neural network.
Experimental results demonstrate that removing anomalous samples has a positive effect on classification accuracy on the test dataset. Among the considered approaches, the three-sigma method proved to be the most effective, providing an accuracy improvement of up to 2.1% by eliminating samples that are highly distant from the distribution center and do not represent typical class characteristics. It is shown that data cleaning leads to an increase in latent space entropy, indicating higher information richness and a more uniform data distribution. This increase in entropy correlates with improved generalization performance of the classifier. The obtained results confirm the relevance of statistical analysis of latent representations as an effective stage in preparing aerial imagery datasets for deep learning applications.

Keywords: aerial imagery, anomalous data, image duplicates, deep learning, convolutional neural networks, autoencoder, latent space, data entropy, image classification, generalization performance.

Published

2026-04-26

Issue

Section

Articles