Automated construction of a software-specific word similarity database

Yuan Tian; David Lo; Julia Lawall

doi:10.1109/CSMR-WCRE.2014.6747213

Communication Dans Un Congrès Année : 2014

Automated construction of a software-specific word similarity database

(1) , (1) , (2, 3)

1
2
3

Yuan Tian

Fonction : Auteur

Singapore Management University

David Lo

Fonction : Auteur

Singapore Management University

Julia Lawall

Fonction : Auteur
PersonId : 913893

Well Honed Infrastructure Software for Programming Environments and Runtimes

Large-Scale Distributed Systems and Applications

Résumé

Many automated software engineering approaches, including code search, bug report categorization, and duplicatebug report detection, measure similarities between two documents by analyzing natural language contents. Often different words are used to express the same meaning and thus measuring similarities using exact matching of words is insufficient. To solve this problem, past studies have shown the need to measure the similarities between pairs of words. To meet this need, the natural language processing community has built WordNet which is a manually constructed lexical database that records semantic relations among words and can be used to measure how similar two words are. However, WordNet is a general purpose resource, and often does not contain software-specific words. Also, the meanings of words in WordNet are often different than when they are used in software engineering context. Thus, there is a need for a software-specific WordNet-like resource that can measure similarities of words.In this work, we propose an automated approach that builds a software-specific WordNet like resource, named WordSim-SE-DB, by leveraging the textual contents of posts in StackOverflow. Our approach measures the similarity of words by computing the similarities of the weighted co-occurrences of these words with three types of words in the textual corpus. We have evaluated our approach on a set of software-specific words and compared our approach with an existing WordNet-based technique (WordNet-res) to return top-k most similar words.Human judges are used to evaluate the effectiveness of the two techniques. We find that WordNet-res returns no result for 55% of the queries. For the remaining queries, WordNet-res returns significantly poorer results.

Mots clés

stack overflow synonyms

Domaines

Informatique [cs]

Julia Lawall : Connectez-vous pour contacter le contributeur

https://inria.hal.science/hal-01086077

Soumis le : vendredi 21 novembre 2014-17:48:04

Dernière modification le : mardi 3 octobre 2023-17:18:04

Dates et versions

hal-01086077 , version 1 (21-11-2014)

Identifiants

HAL Id : hal-01086077 , version 1
DOI : 10.1109/CSMR-WCRE.2014.6747213

Citer

Yuan Tian, David Lo, Julia Lawall. Automated construction of a software-specific word similarity database. 2014 Software Evolution Week - IEEE Conference on Software Maintenance, Reengineering, and Reverse Engineering, CSMR-WCRE, Feb 2014, Antwerp, Belgium. pp.44-53, ⟨10.1109/CSMR-WCRE.2014.6747213⟩. ⟨hal-01086077⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UPMC CNRS INRIA LIP6 INRIA2 SORBONNE-UNIVERSITE SU-SCIENCES

115 Consultations

0 Téléchargements

Automated construction of a software-specific word similarity database

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager