Skip to Main content Skip to Navigation
Book sections

Handling the Deviation from Isometry Between Domains and Languages in Word Embeddings: Applications to Biomedical Text Translation

Félix Gaschi 1, 2 Parisa Rastin 3 yannick Toussaint 2 
2 ORPAILLEUR - Knowledge representation, reasonning
Inria Nancy - Grand Est, LORIA - NLPKD - Department of Natural Language Processing & Knowledge Discovery
3 ABC - Machine Learning and Computational Biology
LORIA - ALGO - Department of Algorithms, Computation, Image and Geometry
Abstract : Previous literature has shown that it is possible to align word embeddings from different languages with unsupervised methods based on a distance-preserving mapping, with the assumption that the embeddings are isometric. However, these methods seem to work only when both embeddings are trained on the same domain. Nonetheless, we hypothesize that the deviation from isometry might be reduced between relevant subsets of embeddings from different domains, which would allow to partially align them. To support our hypothesis, we leverage the Bottleneck distance, a topological data analysis tool used to approximate the deviation from isometry. We also propose a cross-domain and crosslingual unsupervised alignment method based on a proxy embedding, as a first step towards new cross-lingual alignment methods that generalize to different domains. Results of such a method on translation tasks show that unsupervised alignment methods are not doomed to fail in a crossdomain setting. We obtain BLEU-1 scores ranging from 0.38 to 0.50 on translation tasks, where previous fully unsupervised alignment methods obtain near-zero scores in cross-domain settings.
Complete list of metadata

https://hal.archives-ouvertes.fr/hal-03477901
Contributor : Félix Gaschi Connect in order to contact the contributor
Submitted on : Monday, December 13, 2021 - 4:08:20 PM
Last modification on : Wednesday, March 16, 2022 - 3:47:04 AM
Long-term archiving on: : Monday, March 14, 2022 - 7:20:57 PM

File

main.pdf
Files produced by the author(s)

Identifiers

Collections

Citation

Félix Gaschi, Parisa Rastin, yannick Toussaint. Handling the Deviation from Isometry Between Domains and Languages in Word Embeddings: Applications to Biomedical Text Translation. Neural Information Processing, 13109, Springer International Publishing, pp.216-227, 2021, Lecture Notes in Computer Science, ⟨10.1007/978-3-030-92270-2_19⟩. ⟨hal-03477901v1⟩

Share

Metrics

Record views

103

Files downloads

67