Learning High-Quality and General-Purpose Phrase Representations

Lihu Chen; Gaël Varoquaux; Fabian M. Suchanek

Conference Papers Year : 2024

Learning High-Quality and General-Purpose Phrase Representations

(1) , (1) , (2, 3)

1
2
3

Lihu Chen

Function : Author
PersonId : 1086705
IdRef : 271544996

Méthodes computationnelles et mathématiques pour comprendre la société et la santé à partir de données

Gaël Varoquaux

Function : Author
PersonId : 5878
IdHAL : gael-varoquaux
ORCID : 0000-0003-1076-5122
IdRef : 126239894

Méthodes computationnelles et mathématiques pour comprendre la société et la santé à partir de données

Fabian M. Suchanek

Function : Author
PersonId : 12540
IdHAL : fabian-suchanek
ORCID : 0000-0001-7189-2796
IdRef : 203477707

Data, Intelligence and Graphs

Département Informatique et Réseaux

Abstract

Phrase representations play an important role in data science and natural language processing, benefiting various tasks like Entity Alignment, Record Linkage, Fuzzy Joins, and Paraphrase Classification. The current state-of-theart method involves fine-tuning pre-trained language models for phrasal embeddings using contrastive learning. However, we have identified areas for improvement. First, these pretrained models tend to be unnecessarily complex and require to be pre-trained on a corpus with context sentences. Second, leveraging the phrase type and morphology gives phrase representations that are both more precise and more flexible. We propose an improved framework to learn phrase representations in a context-free fashion. The framework employs phrase type classification as an auxiliary task and incorporates character-level information more effectively into the phrase representation. Furthermore, we design three granularities of data augmentation to increase the diversity of training samples. Our experiments across a wide range of tasks show that our approach generates superior phrase embeddings compared to previous methods while requiring a smaller model size. The code is available at https: //github.com/tigerchen52/PEARL

Domains

Computer science

Fichier principal

eacl-2024.pdf (701.87 Ko)

Origin : Files produced by the author(s)

Fabian Suchanek : Connect in order to contact the contributor

https://telecom-paris.hal.science/hal-04465022

Submitted on : Monday, February 19, 2024-10:26:25 AM

Last modification on : Tuesday, February 27, 2024-10:04:58 AM

Dates and versions

hal-04465022 , version 1 (19-02-2024)

Identifiers

HAL Id : hal-04465022 , version 1

Cite

Lihu Chen, Gaël Varoquaux, Fabian M. Suchanek. Learning High-Quality and General-Purpose Phrase Representations. EACL 2024 - The 18th Conference of the European Chapter of the Association for Computational Linguistics, Mar 2024, La Valette, Malta. ⟨hal-04465022⟩

Export

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

INSTITUT-TELECOM INRIA INRIA2 LTCI INFRES DIG IP_PARIS ANR GS-COMPUTER-SCIENCE

61 View

16 Download

Learning High-Quality and General-Purpose Phrase Representations

Abstract

Domains

Dates and versions

Identifiers

Cite

Export

Collections

Share