Cross-lingual homonym detection wyth adversarial embedding alignment and rank-difference-weighted CSLS

DOI: 10.31673/2412-9070.2026.017410

Authors

  • А. І. Пилипенко, (Pylypenko A.) Taras Shevchenko National University of Kyiv
  • Д. Є. Данилко, (Danylko D.) Taras Shevchenko National University of Kyiv

DOI:

https://doi.org/10.31673/2412-9070.2026.017410

Abstract

The article presents a novel method for detecting cross-lingual homonyms – pairs of words in different languages that are similar in form but differ in meaning. Such pairs are a significant source of errors in machine translation and other multilingual Natural Language Processing (NLP) tasks. The research focuses on this problem for typologically close, yet low-resource language pairs, using Polish and Ukrainian as a case study. The proposed approach is implemented as a two-stage pipeline, which significantly improves the quality of aligning cross-lingual word vector representations and the subsequent computation of semantic similarity. In the first stage, monolingual FastText vectors for Polish and Ukrainian are aligned using unsupervised adversarial training, which projects both languages into a shared vector space without using parallel data or English as an intermediary. This alignment is further refined using the Procrustes algorithm, which leverages a synthetic dictionary built from mutual nearest neighbors identified by the Cross-domain Similarity Local Scaling (CSLS) method. The second stage, which is the main innovation of the work, involves refining the standard CSLS metric to compensate for the "hubness" problem in high-dimensional spaces. The authors introduce a rank-difference weighting mechanism that penalizes or encourages word pairs depending on the mutual consistency of their nearest neighbor ranks in both translation directions (from Polish to Ukrainian and vice versa). This correction creates a more sensitive similarity metric, allowing for the effective distinction of three semantic groups: true translations; cross-lingual homonyms, which could be misinterpreted by a human; and formally similar but semantically unrelated words. Experimental results on a specially compiled dataset of 150 Polish-Ukrainian word pairs show that CSLS with rank-difference weighting provides significantly better separation between these groups than standard cosine similarity or standard CSLS. In conclusion, the research contributes to the field of cross-lingual natural language processing by demonstrating that a combination of robust unsupervised alignment and a semantically-grounded, rank-weighted similarity metric makes it possible to effectively solve the complex task of homonym detection.

Keywords: cross-lingual homonyms; word embeddings; CSLS; adversarial alignment; unsuper vised learning; semantic similarity; rank difference; FastText.

Published

2026-03-25

Issue

Section

Articles