Improving Language-Dependent Named Entity Detection

,


Introduction
The recognition of named entities is an important starting point for many tasks in the area of natural language processing.Named Entity Recognition (NER) refers to methods that identify names of entities such as people, locations, organizations and products [1,2].It is typically broken down into the two subtasks entity detection (or "spotting") and entity classification.In many application scenarios, however, it is not only of interest which types of entities are contained in a text, but also how the entities can be semantically linked to a knowledge base.The task of correctly disambiguating and linking the recognized named entities in a text into a knowledge base with an external definition and description is referred to as Named Entity Linking (NEL) [3].The overall goal is to make sense of data in the context of an application domain [4].The whole pipeline (including the aforementioned tasks of NER and NEL) is strongly dependent on the knowledge base used to train the named entity extraction algorithm [5].Most approaches for linking entities leverage on the use of Wikipedia (wikipedia.org),Dbpedia (dbpedia.org),Freebase (freebase.com)or YAGO (yagoknowledge.org)as the knowledge base.Although widely used and the largest online encyclopedia with millions of articles, Wikipedia may not be sufficient for more specific domains and contexts.For example, in the German Wikipedia, only some large Austrian organizations are represented, names of persons are rare, etc.Moreover, the English Wikipedia does not hold this specific information either.Among others, Piccinno and Ferragina [6] recognized that recent research tends to focus its attention on the NEL step of the pipeline by trying to improve and implement new disambiguation algorithms.However, ignoring the issues raised by entity recognition leads to the introduction of many false positives, which provoke a significant loss in the overall performance of the system.It would therefore be better to first try to improve the quality of the NER spotter.
Another problem area relates to differences in the language itself.It has been acknowledged that linguistically motivated and thus language aware spotting methods are more accurate than language independent methods [7].The German language has a lot of differences for example in the use of upper and/or lowercase, compound nouns or hyphens to concatenate nouns.However, improvements in a certain language usually come at the expense of ease of adaptation to new languages.In addition, the established NER/NEL challenges and tasks of the scientific community like the OKE challenges [8], the NEEL challenge series [9], or the ERD challenges [10] are in the English language and therefore language-dependent improvements are often not in the focus of the research.
Moreover, the results from different tools need to be comparable against certain quality measures (cf.Section 4.1) based on the same dataset.Frameworks addressing the continuous evaluation of annotation tools such as GERBIL [11,12] can be used for comparison, but evaluation datasets provided by GERBIL as "gold standards" are only available for the English language as well.
The objective of this paper therefore is to (i) develop an approach for language-aware spotting and (ii) to evaluate the proposed spotting approach for the German language.
After an analysis of the state of the art in spotting methods in general (Section 2), the paper focuses on possibilities to optimize the spotter for a certain language in Section 3. In Section 4, evaluation measures are discussed, followed by an analysis of available datasets for evaluation.Additionally, we show how a German dataset was developed and used for evaluation purposes.Section 5 presents results of the experiments and final conclusions are drawn in Section 6.

State of the Art in Entity Detection (Spotting)
As mentioned above the entity detection ("spotting") is an important task in the area of NEL; a couple of authors emphasize the importance of a correct entity spotting in order to avoid errors in later stages of the entity linking task.[6,13] Several approaches to the spotting task can be identified in the literature:  NER tagger.Some tools and approaches rely on existing implementations of NER taggers such as Standford NER tagger or OpenNLP Named Entity Recognition in order to spot surface forms of entities [14][15][16][17][18].The Standford NER tagger is an implementation of linear chain Conditional Random Field (CRF) sequence models, the OpenNLP NER is based on a Maximum Entropy model. POS tags and rules.A couple of authors use part of speech (POS) taggers and/or several rules in order to identify named entities [19][20][21][22][23][24][25][26][27][28][29][30][31][32][33][34].The rules range from simple rules such as "capitalized letter" (if a word contains a capitalized letter the word will be treated as a spot), stop word lists, "At Least One Noun Selector"-rule to complex, combined rules. Dictionary based techniques.The majority of approaches leverage techniques based on dictionaries [6,19,31,[35][36][37][38][39][40][41][42][43][44][45].The structure of Wikipedia provides useful features for generating dictionaries: ─ Entity pages: Each page in Wikipedia contains a title (e.g."Barack Obama") that is very likely the most common name for an entity.─ Redirect pages: Wikipedia contains redirect pages for each alternative name of an entity page.E.g. "Obama" is a redirect page to "Barack Obama".─ Disambiguation pages: Disambiguation pages in Wikipedia are used to resolve conflicts with ambiguous article titles.E.g. "Enterprise" may refer to a company, to aircrafts, to Star Trek, and many more.Disambiguation pages are very useful for extracting aliases and abbreviations.─ Bold phrases: Bold phrases in the first paragraph of a Wikipedia entry can contain useful information such as abbreviations, aliases or nicknames.E.g. the bold phrase in the page "Barack Obama" contains the full name ("Barack Hussein Obama II").─ Hyperlinks in Wikipedia pages: Pages in Wikipedia usually contain hyperlinks to other pages; the anchor texts of these hyperlinks may provide synonyms and other name variations. Methods based on search engines.Some authors try to use web search engines such as Google to identify candidate entities [46][47][48][49].
The majority of the papers use dictionary approaches.Nevertheless, the above mentioned approaches are usually combined, e.g.[6] leverages OpenNLP NER, a dictionary approach based on Wikipedia with utilization of several features such as anchor texts, redirect pages, etc.The authors usually provide measures (recall, precision, F1) of the effectiveness of their approaches; unfortunately, these measures cannot be directly compared because usually different datasets are used and the approaches are optimized towards these datasets.

TOMO Approach to Optimize Spotter for the German Language
This section details our approach to optimizing a spotter ("TOMO") for the German language with a focus on the spotting phase within the entity linking pipeline .The base system used is Dexter [36,37], an open-source framework (available at https://github.com/dexter/dexter) that implements a dictionary spotter using Wikipedia content.Fig. 2 shows the approach comprising the construction and the annotation process using a dictionary.

Fig. 2: Basic TOMO Architecture
The annotation process involves a spotter and a disambiguator with an annotated text as output.The spotter detects a list of candidate mentions in the input text, and retrieves for each mention a list of candidate entities [36].When spotting a text, individual fragments or words from the text ("shingles") are compared with the dictionary of up to six words ("n-grams").Before being able to use the dictionary for NER, the dictionary needs to be filled with known entities first.Therefore, each Wikipedia article is processed using the title of the article as well as all internal links (anchors within Wikipedia) as spots for the dictionary.In addition, the measures of mention frequency (mf), link frequency (lf) and document frequency (df) are calculated and stored as well.Both in the construction of the dictionary and in the annotation of a text based on this dictionary the text fragments (shingles and known entities) go through a cleaning pipeline with a series of replacements.The cleaning pipeline in pseudocode is as follows: foreach (article in knowledgebase) listOfSpots = preprocess(getTitle(article)) listOfSpots = preprocess(getAnchors(article)) calculateMeasures(listOfSpots) preprocess(textfragment) clean(textfragment) filter(textfragment) map(textfragment) A "cleaner" performs a simple transformation of a text fragment (e.g.transform a text to lowercase, remove symbols, remove quotes, unescape Javascript, clean parenthesis, etc.).A "filter" allows the removal of a given text fragment if it does not respect a filter constraint (e.g.delete text fragments that are below the threshold for commonness, have less than three characters, consist only of numbers or symbols, etc.).A "mapper" returns several different versions of the spot (e.g. a "quotes mapper" generates from [dave "baby" cortez] the spots [dave "baby" cortez], [baby], [dave cortez]) [56].
Moreover, for simplification purposes, many tools use lowercase filters.Full text search indices such as Lucene also imply such a lowercase behavior per default, which in many tasks (e.g.search engine querying, microblogging analysis) makes sense.In our setting, lowercase simplification is responsible for introducing several spotting errors (e.g. the sentence "the performance is worse": the word "worse" translates to "schlechter" in German and the spotter identifies this word as a candidate entry for the Wikipedia page "Carl Schlechter").In German language, only nouns and proper names are written with capitalized initial letters.

Language-aware preprocessing pipeline
In a setting where typing errors are relatively rare (e.g. in press releases, formal documents) the application of a case sensitive setting is therefore a reliable and straightforward approach to increase the precision of the spotter for the German language [57].
Another important aspect of a language-aware approach is the correct usage of the code page.For the English language, the US-ASCII code page is the preferred setting, as it uses less space than other code pages.In the German language, many named entities contain non US-ASCII characters, like umlaute or the German eszett.Using US-ASCII filters, these characters are replaced by their English representation (umlaut a gets replaced by an "a", etc.).This sometimes changes the whole meaning of the word, as the English replacements are also used in German language and this gets even worse in combination with lowercase filtering.E.g. the sentence "we made this", with its German translation "Wir haben das gemacht": the word "made" translates to "gemacht" in German and this is disambiguated to "Gemächt" (the male genitalia).The UTF-8 code page can be used as a solution to this problem.
Additionally, some minor issues may occur due to differences in the language of the Wikipedia syntax itself.For instance, it is possible to link images within Wikipedia with the common English terms "File:" or "Image:", but the German Wikipedia additionally allows the deprecated terms "Datei:" or "Bild:" as well.Such filters therefore also need to be aware of differences in the German language in order to improve spotting.

Evaluation Measures and Datasets
In this section, we discuss the evaluation of spotting named entities in the German language.This includes which measures, tools and datasets to use.

Measures and Benchmarking
To ensure comparability across different NER and NEL system evaluations the most common measures are precision, recall, F1 and accuracy.
Precision.Precision considers all spots that are generated by the system and determines how correct they are compared to a gold standard.Consequently, the precision of a spotting system is calculated as the fraction of correctly spotted entity mentions compared to all spotted mentions generated by a particular system. ( Recall.Recall is a measure that describes how many of the spots of a gold standard are correctly identified by a system.It is the fraction of correctly spotted entity mentions by a particular systems compared to the all entity mentions that should be spotted according to a selected gold standard. ( F1.To generate a single measure for a system from recall and precision, the measure F1 was developed.It is defined as the harmonic mean of precision and recall as shown in equation 3.

GERBIL.
As also minor differences between these measures exist, we use the webbased benchmarking system GERBIL (gerbil.aksw.org) to evaluate these measures for our system and compare them with others.GERBIL is an entity annotation system that provides a web-based platform for the comparison of annotators [11].Currently it incorporates 13 annotators and 32 datasets for evaluating the performance of systems.The evaluation is done using uniform measuring approaches and well established measures like the aforementioned recall, precision and F1 [12].Consequently, GERBIL can be used for benchmarking different annotators.External tools can be added to the GERBIL platform by providing an URL to a REST interface of the tool.Besides the integrated datasets, GERBIL allows for the use of user-specified datasets.As GERBIL is based on Natural Language Programming Interface Format (NIF), user-specified datasets also have to be uploaded using this format.Additionally, GERBIL provides Java classes for implementing APIs for datasets and annotators to NIF.Due to these features, GERBIL is also used by challenges (e.g.OKE challenge) as platform for evaluating the performance of contestants.

Dataset
To test an entity linking system a gold standard dataset must be provided.This dataset has to include all sentences to analyze, spots to be linked and links to a knowledge base for correct disambiguation.Systems are then ranked by comparing the above-mentioned evaluation measures (recall, precision and F1) they score in relation to this dataset.There are already a number of English corpora to test entity recognition and entity linking systems.Some of them emerged from challenges that compare the results of multiple algorithms and systems to assess the performance of different approaches.For example the datasets of the OKE challenge as part of the European Semantic Web Conferences 2016 (2016.eswc-conferences.org), the NEEL challenge of the Microposts Workshop at the World Wide Web Conference 2016 (microposts2016.seas.upenn.edu),or the ERD challenge at the SIGIR 2014 Workshop ERD'14 (sigir.org/sigir2014)are publicly available.

Requirements and Review of Existing Datasets.
To test the performance of our approach to spot entities in the German language we had to select or develop a dataset.For this task, we defined the following requirements:  The dataset has to be available in German to test the performance of the spotter for German texts.
 The dataset should be testable via the GERBIL web service.Thus, it should be already available in GERBIL or encoded in NIF format. The dataset should be widely used, specifically by new systems, to be able to compare our results with leading systems and approaches. The dataset should be independent of a certain domain (e.g.only articles about economics). The content of the dataset should be comprised of natural language in encyclopedia entries or news.Specific content like tweets or queries were not of interest, since these datasets often have just very few spots with average entries per documents lower than 2.0. The dataset should include co-references to evaluate the performance improvements of future enhancements of our system.
We examined existing datasets and their suitability for our requirements.Table 1 shows the results of this literature review which showed that nearly all available datasets are for the English language.The three German datasets found were not appropriate for our requirements because they were too domain specific (News-100, LinkedTV), possess only the "classic" named entities (persons, locations, etc.), had no co-references defined, and/or are not publicly accessible (GerNED).Since none of these datasets fitted our requirements, we decided to develop a new dataset to evaluate the spotter against German texts.
Development of a German gold standard dataset.We chose to develop a new German dataset based on the evaluation dataset of the OKE challenge 2016 ("OKE 2016 Task 1 evaluation dataset") for several reasons.Since the content of this dataset originated from Wikipedia articles, it covers a wide range of topics.Therefore, it also consists of natural language and not of tweets or search queries.Furthermore, documents are long enough to contain multiple spots (6.18 average entities per document) and they include co-references as well.Additionally, the English version of the dataset is coded in NIF format and is already integrated in GERBIL.Finally, with 55 documents and 340 entities, we considered this dataset to be of an appropriate length.
To develop the new dataset based on the dataset, we conducted a multi-step approach that consisted of the following tasks: 1. Identify all documents and included spots in the NIF file of the OKE 2016 Task 1 evaluation dataset.2. Translate all documents in this dataset using Google Translate (translate.google.com) 3. Adjust the initial Google translation by improving German grammar, word order, etc. by native speakers.4. Identify all English spots of dataset in the German translation.5. Identify the corresponding entities in the German knowledge base (de.wikipedia.org).6. Link the spots to the identified knowledge base entities using links in a HTML file.7. Transform the HTML file to NIF using a converter.This process was not straightforward and a number of problems occurred that were mainly based on ambiguities in steps 5 and 6:  Because the English Wikipedia is more than twice as large as the German Wikipedia, some spots had no representation in the German knowledge base.This was mainly the case with persons (e.g.Andrew McCollum, James Alexander Hendler) and organizations (e.g.Kirkcaldy High School, American Association for Artificial Intelligence). Literal translation by Google Translate led to surface forms that were wrong or unusual (e.g."artificial intelligence researcher" was translated to "künstlicher Intelligenzforscher"). Translation by Google Translate led to a sentence structure and grammar that was sometimes unusual or incorrect for German sentences. In some cases, it was not clear which term was the correct German translation of the English term in the specific context of sentences (e.g."independent contractor" was translated to "unabhängiger Auftragnehmer" by Google Translate, but "freier Mitarbeiter" was considered to be the correct translation for the context of the sentence). In a few cases, it was not clear to which entity in the German knowledge base an entity should be linked (e.g. the English term "treasury" can be translated based on traditional British or American interpretations of the word as "Finanzministerium" or "Schatzamt", but is now also used in its English form as a department of corporations.) In order to cope with these uncertainties, three researchers (German native speakers) independently identified the corresponding German surface form for the spot based on the translated text.For every spot that led to different surface forms or links the different solutions from the three authors were discussed and a majority decision was made by voting.
As a result, some spots were not available in the German knowledge base and therefore the resulting dataset has fewer spots than the original.Since not all systems currently support co-references, we developed two versions of the dataset.One with coreferences and one without co-references (15 documents incorporated a total of 24 coreferences).The resulting corpus can be downloaded here: https://github.com/HCSolu-tionsGesmbH/OKE-Challenge-German.

Experiments and Results
Based on the discussion described in Section 4 we built different test cases to evaluate the changes between a language independent (n/a) and an explicit German language (de) setting.In addition, we considered case sensitivity as a test case for our experiments and built a model based on a case sensitive (cs) and case insensitive (cis) setting.These model characteristics led to the four different test cases shown in Fig. 3.

Fig. 3: Test cases
Using the German case sensitive model, we experimented with the commonness threshold.We cut out 0%, 5%, 10%, 15% and 20% of spots with the lowest commonness and evaluated the resulting F1 scores.Results showed that using all spots (i.e.not cutting out any) led to the highest F1 score.Thus, this setting resulted in the highest recall without lowering the precision too much.Consequently, we tested all four test cases using this setting in GERBIL with our developed dataset that is based on the recent German Wikipedia dump from 2017/05/01.Table 2 shows the resulting scores for recall, precision and F1.The annotators TagMe 2, xLisa-NER and xLisa-NGRAM of GERBIL did not produce any results (the GERBIL experiment reported: "The annotator caused too many single errors.")and could not be evaluated.FRED produced the highest recall.As FRED aims at producing formal structure graphs from natural language text and is based on a dictionary comprising different knowledge bases including WordNet aims at a high recall.
On the other side it automatically translates the input text beforehand which may lead to a decrease in precision [45].FOX combines the results of several state-of-the-art NER tools by using a decision-tree-based algorithm which performed best on the precision measure [58,59].In addition, the tool is capable of automatically detecting German language text input.Results also show that the TOMO approach using the German language setting in combination with the case sensitive model achieved the highest F1 score among all tested annotators.

Conclusions and Future Work
The paper discusses an approach for language-aware spotting and evaluates the proposed spotting approach for the German language.The results indicate that languagedependent features do improve the overall quality of the spotter.This is necessary, because errors introduced in the spotting phase have an effect on the disambiguation step and can hardly be corrected.A limitation of this work is that the performance metrics of TOMO vs. other systems are only partially comparable, because the annotators were either developed only for the English language or do not take into account any language specifics.However, we were able to show that language-dependent features improve spotting quality.With the availability of a dataset in German and English language, it is possible to directly compare the performances of the systems for different languages.
When the authors of this paper developed the German corpus a lot of discussions about which surface forms should be linked to the knowledge base arose.For example this text (taken from OKE 2016 Task 1 evaluation dataset) contains several links (shown as underlined words): "Ray Kurzweil grew up in the New York City borough of Queens.He was born to secular Jewish parents who had emigrated from Austria just before the onset of World War II."It is not quite clear, why "parents" are linked to an entity, but some text fragments that are probably more in need of explanation such as "Jewish" or "World War II" are not spots.Wikipedia contains a separate page that provides guidelines for linking.These guidelines suggest for example, not to link everyday words, but to link to other articles that will help the reader to understand the context more fully [60].However, every gold standard obviously represents a certain way of thinking.Furthermore, performing well on a certain gold standard just means the system replicates a certain way of thinking very well.
Further research work should include a discussion and development of guidelines or rules which terms should be annotated in a gold standard dataset in order to align the different evaluation datasets.Furthermore, a population of a cross-language and crossdomain gold standard in order to evaluate annotation systems for different purposes would be of value for the community.

Table 1 .
Comparison of gold standard datasets

Table 2 .
Evaluation of spotting results