Building Time-Affordable Cultural Ontologies Using an Emic Approach

. Recently, studies about culturally-aware systems have arisen to address digitized culture. Among these systems those enculturated driven by cultural knowledge embed culture in their design. To deal with the speciﬁcs of cultural groups, the development of machine-readable cultural knowledge representations can provide a substantial help. In this research we present a process to build time-aﬀordable, emic, conceptually-sound and machine-readable cultural representations. These representations originate from Cognitive Anthropology. They follow a three steps methodology: ethnographic sampling, individuals’ personal knowledge elicitation and cultural consensus analysis. We use lexico-semantic relation extraction as a mean to automatically elicit knowledge structures. Their formalisation is achieved through Ontology Engineering. We conducted experiments to build three cultural ontologies in order to assess the whole process. It came out that with the lexico-semantic relation extraction technique, the best representations we can obtain are consensually-limited, incomplete and contain some errors. However, many clues indicate that these problems should be solved by using higher quality elicitation techniques.


Introduction
Interest in cultural awareness grows more popular as globalisation is vector of increasing cultural diversity. Since the s, with the rapidly expending web, culture is digitized and computer systems are now the entities which are the most exposed to its diversity. Culture shapes users' behaviors and thus impacts the performance of many systems/applications. That is why these systems have to develop cultural awareness.
Blanchard et al. [ ] define culturally-aware systems as "any system where culture-related information has had some impact on its design, runtime or internal processes, structures, and/or objectives". They present three types of systems: enculturated systems, runtime cultural adaptation systems and cultural data management systems. Enculturated systems are systems whose design meet the cultural requirements of given cultures [ ]. Runtime cultural adaptation systems aim to artificially reproduce cultural intelligence through two steps: understanding and adaptation. In other words, by identifying one's culture a culturallyintelligent system can provide the right enculturation as presented by Rehm [ ].
The enculturation of a system is constrained by the cultural knowledge available for the latter or a designer. That is why, machine-readable representations providing understanding about cultures could effectively support the development of these systems.
Two approaches can be used to produce representations of cultures. The etic approach has for objective to find cultural universals. It is an outsider view of culture. In contrast, the emic approach tries to identify the specifics of a culture such as their concepts and behaviors. Insight is gained from inside. Currently, cultural knowledge representations used to support the development of enculturated systems are etic-based. Their main appeals are that they are ready-to-use representations easily applicable to any culture [ , ]. However, these representations are coarse-grain and limit the understanding of the cultures they describe [ ]. Therefore, finer-grained emic-based representations are more relevant to develop enculturated systems.
While emic-based representations solve the problem associated with the lack of granularity, their creation is time-consuming. Most of the methodologies used in practice by ethnographers require intensive human intervention (from the ethnographers or participants) in the process of eliciting knowledge. Therefore the latter is hardly scalable, and thus not practicable to deal with the diversity of cultures. As such, the process supporting the construction of emic-based cultural representations must be relatively automatic.
In this paper we present a process applicable to any cultural domains to build time-affordable, emic, conceptually-sound and machine-readable cultural knowledge representations. To construct these representations we followed a methodology coming from Cognitive Anthropology. It is composed of three steps leading to the acquisition of culturally-relevant information: ethnographic sampling, individuals' personal knowledge elicitation and cultural consensus analysis. The time-affordable elicitation of knowledge and its formalisation are similar to what already exist in other ontology engineering works such as SPRAT [ ] or DYNAMO [ ]. We follow Hearst's [ ] method to automatically extract hypernym/hyponym relations from texts. As for the formalisation of the representations, we rely on the Resource Description Framework (RDF) formal language. Therefore, this research is about the emic and automatic generation of cultural ontologies from texts. https://www.irit.fr/dynamo/ Our plan is as follow. We begin by introducing the methodology. It starts with the creation of the cultural knowledge representations and ends with their formalisation. Then, we present our process and the associated design choices. We end by experimenting extensively our process on the public safety domain with police forces coming from Australia, USA and England. Obtaining encouraging results, we conclude this study.

Emic-based Cultural Knowledge Representations
Ethnography is the process of collecting, recording and searching for pattern to describe a culture of people. In other words, ethnography is about discovering cultural knowledge leading to the production of cultural knowledge representations. "New ethnography", ethnoscience or Cognitive Anthropology are founded on the premise that culture is a "conceptual mode underlying human behavior" [ ]. The cognitive theory of culture situates culture in the mind as a system of learnt and shared knowledge [ , ].
This theory shaped a number of methodologies to produce cultural representations which are intrinsically emic. "Ethnographers must discover the organizing principles of a culture-the semantic world of the natives-while avoiding the imposition of their own semantic categories on what they perceive" [ ].
To our knowledge, there is no clearly defined methodology to create cultural representations. Most of the ones developed in the literature are based on the ethnographers' experiences. However, these methodologies share three main steps: ethnographic sampling, individuals' personal knowledge elicitation and cultural consensus analysis [ -]. .

Ethnographic Sampling
The ethnographic sampling step is based on the idea that cultural knowledge is socially-constructed. It aims to capture a representative number of individuals likely to share the same culture and thus similar knowledge. This task is generally achieved through the identification of a community, a set of individuals with long-term, strong, direct, intense, frequent and positive relations [ ].
Once the ethnographic sample is determined, the knowledge of each participant needs to be elicited.
.  [ , ]. They constitute the core of any conceptualisation. As such, individuals' knowledge elicitation is mainly about acquiring concepts and lexico-semantic relations.

Individuals' Personal Knowledge Elicitation
After eliciting the personal knowledge structures of each individual constituting the sample, their distribution has to be analysed to determine their cultural dimension. .

Cultural Consensus Analysis
The The three steps of the methodology leads to the production of cultural knowledge representations. However as such, they cannot be used for the development of enculturated systems as computers systems are not yet able to make sense of them. To be understandable, they have to be formalised.

Formal Cultural Knowledge Representations
The cultural representations are composed of knowledge structures. The formalisation of such structures is studied in the field of Knowledge Engineering, more precisely the Ontology Engineering subfield. Therefore, methodologies to build ontologies could be used to formalise the cultural representations.

. Ontologies
Gruber had defined an ontology as "an explicit specification of a conceptualisation" [ ]. The term 'explicit' in Gruber's definition means that the knowledge must be specified unambiguously, constraining its interpretation. The principal components of an ontology are labels, concepts, relations and axioms. Axioms are rules associated to the relations in order to embed logic necessary for reasoning.
Borst [ ] added to the former definition that the specification had to be formal and the conceptualisation shared. Indeed, it is necessary that the conceptualisation results from a consensual agreement to ascertain that the knowledge embedded is coherent and consistent within a specific context. This task is called an 'ontological commitment'. This aspect is ensured by the shared dimension of the cultural representations. The formalisation of the specification is needed for interoperability, re-usability and especially for for enculturated systems to read cultural representations.
There are different levels of formalism depending on the language used to express the ontology ranging from informal, mostly written in natural languages, to formal, based on machine-readable languages. Formal languages like RDF (Resource Description Framework) or OWL (Web Ontology Language) are supporting the semantic web. RDF is a language based on entities (resource, property, value) which constitute triples of the form (subject, predicate, object). Resources are concepts described thanks to an Uniform Resource Identifier (URI). It makes sense since ontologies are non-ambiguous specifications. Properties can be attributes or any other kind of relations, most likely semantic ones. Values are literals pointing either to a symbol or another resource. The common syntax to formalize RDF is the XML, called RDF/XML. Ontologies written in RDF can be interpreted by machines through SPARQL Protocol and RDF Query Language (SPARQL). .

METHONTOLOGY
Methodologies to create ontologies are mostly based on experience [ ]. The METHONTOLOGY is a proven framework describing the general steps to build an ontology [ ]. Common steps are composed of specification, conceptualisation, formalisation, implementation and evaluation.
The specification consists in planning the production and exploitation of an ontology. At a minimum, it defines its primary purpose, level, granularity and scope. These specifications are mainly guidelines for the conceptualisation. Typically, the conceptualisation step is carried out by a group of domain experts. The goal is to discover the significant concepts and associated relations related to a domain [ ]. The formalisation step expresses the conceptualisation with formal languages. It is often manually supervised by knowledge engineers or with the support of a software like Protégé . Mapping techniques can also be used to automatically transpose informal to formal knowledge [ ]. The implementation step addresses the technical and practicable aspects associated with the usage of an ontology by a computer system. The evaluation step validates each step according to the specifications.
Following the METHONTOLOGY, we are able to produce formal cultural ontologies by considering cultural knowledge representations as conceptualisations. Finally, these ontologies are readable by computer systems and can provide a significant amount of understanding about the cultures they represent.

Building Time-Affordable, Formal and Emic-based Cultural Representations
The design of our process was driven by the METHONTOLOGY whose conceptualisation step consists in the methodology coming from Cognitive Anthropology. Among other choices required to build the process, we decided to use the lexico-semantic relation extraction to have an elicitation as automatic as possible. .

Selecting Individuals based on Shared Social Criteria
Typically, cognitive anthropologists select their sample through shared sociallyrelated criteria such as genders, religions, jobs or areas -working places [ ], towns [ ] or regions [ ]. While the strength of this method comes from its ease of use and speed, its weakness is that it cannot fully guarantee that the selected individuals actually represent a community. Effective but costly techniques to identify communities can be found in social sciences such as the community detection algorithms coming from social network analysis.
In this study, samplings are created by following the traditional technique as a number of studies proved its efficiency. .

Eliciting Automatically Individuals' Knowledge from Texts
Automatically extracting individuals' knowledge structures from texts is an indirect elicitation technique [ ]. It is composed of two tasks. The first one consists in collecting a sufficient amount of textual data for a given individual. The second task aims to retrieve the latter's knowledge (i.e. significant concepts and/or relations) by analyzing the data.

. . Collecting Web Data
Ethnographic data are mainly textual and most of the time collected thanks to interviews or observations. Besides being costly in time, recording data through these means also biases to some extent the data. The safest and fastest technique to collect data is to gather already existing raw data. Nowadays, the web provides a large amount of freely available textual data about many individuals from which data can be collected. In our process, the data were retrieved directly from websites. Textual data collection was achieved thanks to HTTRACK . It is a tool that can mirror the content of a website by crawling and downloading its files.
The automation of the data collection came with an additional constraint during the sampling step. Indeed, it became necessary to verify that the individuals composing the sampling disposed of accessible online data. http://www.httrack.com/

. . Textual Data Analysis
The goal of the data analysis is to retrieve the conceptualisation of an individual [ ]. This part of our process consists in acquiring knowledge structures by mining significant concepts and their relations. It required several preprocessing steps. We started by cleaning the data, followed with natural language processing and ended by annotating the lexico-semantic relations to extract.

Preprocessing
The web nature of the data collected drove the cleaning operations. Web data can come in various file formats (.doc, .odt, etc). The text extraction from any files was achieved by Apache Tika . We handled language heterogeneity by identifying the language of each document with the LangDetect API [ ]. We only kept English documents. OpenNLP was used to detect sentences. We decided to work on the sentence level rather than the document level mainly to avoid data redundancy by ensuring that the sentences were unique. For example, documents coming from websites are often distinct from each other while they are composed of duplicate contents such as menus, Twitter or Facebook feeds and so on.
Then, we used the Stanford CoreNLP API to support common natural language processing operations: tokenization, Part of Speech (PoS) tagging and lemmatization. Eventually, nominals which constitute the main concepts of conceptualizations were found using a simple pattern matching based on the PoS tags of the tokens.
After having cleaned and preprocessed the textual data, the results were stored as annotations in a 'serial data store' using GATE (General Architecture for Text Engineering). This last operation was required to easily retrieve and mine the data.

Discovering Important Concepts
Finding significant concepts in content is based on the idea that the number of occurrence and importance of a token are correlated. Thus, term frequency is often used to weight and rank terms. Other metrics can achieve similar results, such as TF/IDF (Term Frequency/Inverse Document Frequency).
In our process, the important concepts were selected by coupling the quantification of nominals with a rough filtering on their total occurrences.

Finding Significant Relations
In this study, we use the most popular method to find lexico-semantic relations. Introduced by Hearst [ ], it relies on handwritten syntactic patterns indicative of semantic relations. For example, in the sentence: 'A dog is an animal', the syntactic pattern 'is a' indicates that there is a hypernym/hyponym relation between 'animal' and 'dog'. Therefore, hypernym/hyponym relations can be discovered through a simple mapping, by using the expression Y is a X, with Y and X two nominals. Thereafter, many researchers confirmed the relevance of Hearst's methodology by applying it for other lexico-semantic relations [ -].
Like Wang et al. [ ], the implementation of lexico-semantic relation extraction was achieved through the Java Annotation Patterns Engine (JAPE) which is specific to GATE. The syntactic patterns we used are summarized in the table . The final set of extracted lexico-semantic relations is constituted by filtering them according to the significance of their pairs of concepts.

Syntactic patterns
At the level of the individuals, we are able to elicit their personal knowledge. However, we cannot yet determine which part is cultural. To this end, we have to analyze the 'sharedness' of these distributed knowledge structures. .

Aggregating Concepts and Lexico-Semantic Relations
To analyze the cultural consensus of the sample, the elicited personal knowledge (concepts and lexico-semantic relations) of each individual was aggregated. It led to a mixed representation composed of knowledge ranging from personal to cultural (similarly to Vuillot et al. [ ]). To obtain a valid cultural representation, it is necessary to evaluate the knowledge and filter the latter based on its distribution. At this stage we are able to create a cultural representation from an ethnographic sample. However, these representations cannot be implemented into enculturated systems and thus are still unusable. They have to be formalised. .

Ontologizing Concepts and Lexico-Semantic Relations
In our process, we used the "ontologizing" technique [ ]. After the consensus analysis, we mapped the concepts and hypernym/hyponym relations into RDF classes and RDFs sub-classes. The formalisation constitutes the last step of our process which is summarized in figure . It starts by selecting individuals based on shared social criteria. Then, web data about each individual of the sample are collected. These data are analysed through text-mining techniques to automatically elicit their respective personal knowledge (embodied in the conceptual structures). By quantifying the sharedness of individuals' personal knowledge, we are able to determine the cultural consensus. The latter analysis enables the production of a cultural representation. Finally by ontologizing the conceptual structures, a formal cultural ontology is created. Having described the whole process to produce formal timeaffordable cultural representations, the next section consists of experiments to assess its performances.

Experiments
The public safety domain was chosen for our experiments for two main reasons: the available amount of data and current social context. After a description of the settings, we present and discuss the results associated to three formal cultural representations we tried to produce.

. Settings
We constituted three samples with culturally different police forces coming respectively from Australia, United States and England (see table  ). Considering agencies as individuals may not be the best choice to carry out our experiments. However, this decision was driven by the necessity of being able to collect large amount of textual data about a single domain for a consequent number of 'individuals'. Table . Samples with their respective number of individuals.

Number of Individuals Australian Police Forces American State Police Forces English Police Forces
While collecting data from the web, due to the robot protection or other factors, the content of some websites could not be retrieved. Therefore, we excluded these police forces from our samples.
After having retrieved the data, we preprocessed it. We cleaned the textual data and kept well formed sentences with a length between and characters. We removed police forces having less than , sentences left. This threshold was used to separate the individuals which possess too few data. The table provides updated information about our samples. The 'Appendix A' provides the detailed list of individuals we had at the initial stage of the process and their respective sample. Details about the police forces remaining and the associated number of sentences are available in 'Appendix B'.
Then, we quantified the nominals and extracted the lexico-semantic relations for each individual. For each sample, the nominals were ranked given their averaging position. We kept arbitrarily the top nominals and filtered accordingly the hypernym/hyponym relation candidates in order to create the various domain conceptualizations. At this point, we were able to produce cultural representations for the Australian, American and English police forces.

. Evaluation
The evaluation of our experimental results was achieved by relying on a semiautomatically constituted gold standard. Three gold standards were constituted with labeled lexico-semantic relations, one for each sample. Because every police forces belong to the westerner culture, we were able to use WordNet [ ], which possesses a similar cultural bias, to obtain automatically assessments on the elicited lexico-semantic relations. Whereas, our raw results show an average precision of %. According to Cederberg and Widdows, the discrepancy in precision is mainly due to the difference of quality between the datasets. In fact, Hearst use Grolier's Encyclopedia, Maynard et al. use Wikipedia and themselves the British National Corpus. In contrast, we are using sources of poorer quality as our data came directly from website pages. We believe that it can explain our lower initial precision.
We observed the potential cultural representations by varying the number of agreements increasingly. We expected that highly consensual representations have higher precision but a lower relation coverage compared to mixed ones. Our hypothesis was that to obtain the best cultural representations, it is necessary to manage properly this trade-off between precision and loss. We computed the loss Based on these observations, we conclude that the main problem is the high loss. The loss could be explained by three factors. The first one concerns the high number of relations specific to individuals such as (partner, hypernym, northumbria police), but it does not constitute a problem as we are not interested by those. The second factor corresponds to the cultural domain. Many extracted relations are related but do not strictly belong to the public safety domain like (resource, hypernym, goods). Similarly to the first factor, this loss does not matter. The third factor concerns the scarcity of the syntactic patterns enabling the extraction of the lexico-semantic relations. Their low recall has for consequence that the discovery of a relation in a corpora is related to luck. This last factor is truly problematic.
This issue is directly linked to the knowledge elicitation technique used in our study. Indeed, lexico-semantic relation extraction relying on syntactic patterns cannot provide the quantity nor the quality required to support properly our process to produce cultural representations. In fact, no existing hypernym/hyponym relation mining technique using large corpora might be able to achieve this task. So we were expecting those results.
Nevertheless, with few efforts we were able to produce a relevant partial cultural ontology for the English Police Forces composed of hypernym/hyponym relations. We used Gephi to visualize the end result. On figure we focused on the concept 'crime'. We observe common hypernym/hyponym relations as well as an interesting contextual relation between 'hate crime' and 'issue'. Such relations are really meaningful in a cultural conhttps://gephi.org/ text. In fact, the focus on hate crimes by English police forces comes from the enforcement of the Equality Act . It also becomes obvious that many relations are missing, but we believe that this representation provides a coherent foundation to support further improvements.

Conclusion
We have to remind that our goal was to build time-affordable, emic, conceptuallysound and machine-readable cultural representations. We introduced a methodology coming from Cognitive Anthropology to build emic-based cultural conceptualisations. In addition, we explained their formalisation through Ontology Engineering. Then, we presented a process to produce mostly automatically the representations. Using lexico-semantic relation extraction, the best we can obtain with this technique are representations consensually-limited, incomplete and containing some errors. However in the future, by using higher quality elicitation techniques, these problems could be solved.
Up to day, culturally-intelligent systems are developed using etic-based cultural representations. While facilitating cross-cultural mediation, these coarsegrain representations are not fitted for the development of systems requiring a deep understanding of cultural aspects [ ]. We believe that the production of finegrain cultural ontologies, obtained through an emic approach, is a first step for the development of a new generation of artificial cultural awareness supporting these systems.  , no. , pp.