Big Data in the Public Sector. Linking Cities to Sensors

. In the public sector, big data holds many promises for improving policy outcomes in terms of service delivery and decision-making and is starting to gain increased attention by governments. Cities are collecting large amounts of data from traditional sources such as registries and surveys and from non-traditional sources such as the Internet of Things, and are considered an important field of experimentation to generate public value with big data. The establishment of a city data infrastructure can drive such a development. This paper describes two key challenges for such an infrastructure: platform federation and data quality, and how these challenges are addressed in the ongoing research project CPaaS.io.


Introduction
The digitization of the economy and society becomes apparent in the many applications and devices that use and produce data. As businesses and individuals use available technological innovations to improve business and facilitate the demands of everyday life, governments around the world are struggling with how to best put these advancements to use in the public sector and create public value. The European Commission for example has acknowledged that "data has become an essential resource for economic growth, job creation and societal progress" [1] and is working on a policy and a framework for the free flow of data to reap the potential benefits and address challenges both in the technical as well as in the societal and legal fields. One of the primary difficulties lies in the diversity and the speed of the technological developments: The deployment of sensors delivers a multitude of new data sets, but with sometimes unreliable data, big data and machine learning is deployed for data analysis, and linked data and open government data approaches are used to make data more accessible to a wider clientele. The societal challenges that digitization will bring about manifest themselves first in the metropolitan, urban environment; hence the so-called "smart city" is an ideal field for experimentation to better understand and learn about the opportunities and potential pitfalls. While the term "smart city" is certainly hyped and many different activities are carried out under this label, two points are interesting when looking at cities that are generally regarded as pioneers in the field, e.g., Amsterdam, Barcelona, or Vienna. They all use a private-publicpartnership (PPP) model in order to bring together actors from different sectors and with different expertise and interests [2] [3], and the establishment of a platform for information exchange and data access is seen as a key enabler for an effective implementation of a smart city programme [4].
With the main goal of developing such a platform for smart city innovation, we launched in 2016 the CPaaS.io project. 1 In this 30-month research and innovation action between Europe and Japan, data from various sources are made accessible via a cloud-based platform to application developers and service providers. Data sets from open government data portals and other administrative or publicly available data can be linked with Internet of Things (IoT) data, e.g., data from sensors deployed in the communal infrastructure or from sensors worn by participants in a city event. The project is also developing several application use cases in the domains of event management, water management, and public transportation and pilots these in partnering cities like Amsterdam, Sapporo and Tokyo. The aim of this paper is to discuss the particular challenges of implementing a smart city platform related to data management and in particular data quality management of various data sources, including IoT data in particular. The use of this type of data is relatively new to governments and irrespective of the usage context, the question of how to validate IoT data is still quite open. In order to be adopted, a city data infrastructure needs to provide information on data quality in form of metadata and, as will be shown, linked data provides several advantages in that respect. At the current stage of the project, we provide generic considerations on the named challenges, based on selected research in the field. Thus, we do not account for potentially differing requirements depending on e.g. the size or smart city maturity (cf. [2]) of a city.
The remainder of this paper is structured as follows: in the following section we summarize the state of adoption of big data in the public sector, with a special focus on the context of smart cities and the usage of the Internet of Things. Section 3 provides an overview of two challenges that need to be addressed for the successful deployment of such a platform: improving data quality and facilitating platform federation. Section 4 then describes linked data as a solution approach to tackle these issues and describes the state of the art in the field, while section 5 goes into more detail how the solution is implemented in the context of the CPaaS.io project. Section 6 finally contains the conclusions and outlines future work.

2
Big Data in the Public Sector

Big Data Opportunities for the Public Sector and State of Adoption
Big data is about generating value through collecting and analysing information to extract knowledge and insight (cf. [5] [6]). Governments are increasingly aware that big data offers value potentials for the public sector [7]. As scholars point out however, implementation by now tends to be limited ( [7] [8]) or as Desouza & Jacob put it, there is "some tension between the promise of Big Data and reality" [9]. Accordingly, big data in the public sector has only recently started to raise academic interest (cf. [10]), but is expected to gain more attention within big data research [11]. A first set of studies and reports rather looks at the public sector as a data producer for big data applications in other sectors (e.g. [12] [13]). Governments generate and collect large amounts of data through their everyday operations and the public sector is thus one of the most data-intensive sectors. Since public sector sources comply with high quality standards, they are considered an essential resource for the data-driven economy, which is reflected in the many open government data (OGD) initiatives that seek to make this data available for re-use [14]. A second stream of research focuses explicitly on governments as big data user. This work includes cross-case studies on existing cases of big data implementation ([15] [16]), general considerations on the opportunities and challenges of big data adoption by the public sector (e.g. [17] [7] [18] [9]), considerations on the preconditions for using big data [8] and/or specific fields of application, such as policy-making (e.g. [19] also [15] [16]). Based on available research, potential benefits of big data adoption in the public sector can be categorized as follows [7]: A first set of opportunities relates to improving the knowledge base. As in other sectors, data analysis is used for generating new insights. Big data analytics can be applied to various domains of public administration (cf. [18] [17] [16] [15]) and holds promises to improve all stages of policy-making [16]: Better and faster insights derived from big data analysis (e.g. through machine learning) may help to better react to unintended effects of a policy decision [19]. It may help to earlier detect mistakes, frauds or security threats [20]. Policy-makers can also use big data technologies to conduct policy impact assessments or gain a better understanding of citizen interests and opinions through the analysis of new data sources, e.g. social media, helping them to prioritize policy issues ([7] [15]).
A second set of opportunities relates to improvements in effectiveness. Data analysis may be used to tailor service provisioning towards the needs of different citizen groups, increasing their satisfaction [18]. Better insights can also contribute to solving social problems related to public transportation, healthcare provision or energy production [8]. Provided as open data, the public sector may facilitate the innovation of products and services by third parties.
A third set of opportunities relates to improvements in efficiency. Big data can be used to achieve greater internal transparency and to improve data sharing across administrative organizations. Available estimates suggest that the public sector could generate considerable revenues through better exploitation of data ([12] [13]). Leveraging new data sources may also positively impact data generation by public administrations, e.g., when producing official statistics [15].

Smart Cities as Big Data application Domain in the Public Sector
Depending on the application domain and the type of data generated, big data analysis in the public sector is closely related to the concept of smart cities [18] (for a defini-tion see [21]). Cities are considered as distinct domain, in which the use of ICT in general [22] and big data in particular are expected to generate impact ( [23] [17] [7] [24] [25]). Thought leaders in the field expect that innovative examples of big data usage are more likely to be found at the city or regional level, since it is easier to get policy makers involved in small-scale initiatives, in new forms of collaboration and data usage [16]. Also, a mapping of smart cities in Europe reveals that there are more smaller than large smart cities, while larger cities have more resources and tend to be more ambitious in scope and more mature regarding implementation [2]. As several authors stress, the Internet of Things is an important data source in the smart city context: "The public sector is increasingly characterized by applications that rely on sensor measurements of physical phenomena such as traffic volumes, environmental pollution, usage levels of waste containers, location of municipal vehicles, or detection of abnormal behaviour" [7]. The analysis of such IoT data sources in combination with other data has the potential to improve urban management and the quality of life of city inhabitants: "Data from different sources need to be integrated and analyzed for smart urban planning, smart transportation, smart sanitation, smart crime prevention, etc" [17]. As Scuotto et al. point out however, "the relationship between IoT and smart cities is still largely unexplored" [26], which requires more research, e.g. on typical technological challenges to be tackled (cf. [23]).

Technological Challenges of Implementing a Smart City Platform
While governments are considered as catalysts for boosting a data-driven economy and growth through opening up their data, big data adoption in the public sector is also confronted with a range of constraints and challenges [7]. These are related to governance (e.g. agreements for integrating data sources across organizations, datadriven culture), implementation (e.g. organizational maturity in terms of IT facilities and data systems, required skills) and risk management (privacy, security) (cf. [18] [7] [17] [19] [8]). As Munné points out [7], it is important that the public sector gains "adoption momentum", moving from marketing around big data to real experience, to derive lessons learned on which applications are valuable and how to deploy them: "This requires the development of a standard set of big data solutions for the sector." The CPaaS.io project provides such a solution for the smart city context and supports experimentation and capability building. One of several challenges to be addressed relates to the federation of existing platforms. Another typical challenge relates to data management. A linked data approach is suited to address both the challenge of ensuring system interoperability as well as data interoperability (cf. [27]).

Federation of Smart City Platforms
To exploit the full potential of a big data strategy, it is not enough that cities just implement a big data platform on their own. Unfortunately, this approach is still com-mon today, leading to data silo solutions lacking interoperability (cf. [17]). Cities though are not standalone entities, they are embedded in a region, in a country, and they often cooperate with other citiestoday also on a global scale. Cities thus need to strive for interoperability of their platforms and the possibility to federate instances: This will enable data analysis across regions from which all participating cities can profit, for example by better understanding traffic patterns or in order to provide better services to an increasingly mobile population. Standards can help to achieve this, but often are not enough, as the adoption especially of data standards on a global scale is slow due to historic, legal and cultural differences.

Governance and Management of Data
In the age of big data, datasets become increasingly "complex", which requires adequate capabilities for managing the data [9]. With the growing need to integrate data from multiple sources, data quality management becomes both more important but also challenging [28]. In the context of developing city data infrastructures, data management and in particular the management of data quality, i.e., ensuring that data is fit for use and free of defects [29] are important aspects (cf. [30] [28]) and part of an organization's overall data governance [31] (see Fig. 1):

Fig. 1. Data governance and related concepts (adapted from [31])
Smart city platforms are used for making decisions and providing services based on the results of querying various datasets, which entails that applications need to be trusted and accepted and data quality plays a major role in that respect (cf. [32] [33] [34]). For a city data infrastructure aimed at integrating IoT data, managing data quality is particularly crucial, as sensors are an inherently unreliable data source. Sensors can become decalibrated, delivering inaccurate data readings, or they can fail or lose connectivity completely. Resolution, sensitivity, timeliness and provenance are other factors affecting the validity of IoT data. As a requirement, data quality is well understood [35] and there are many methodologies to conduct data quality management (cf. [36]) as well as models and frameworks for assessing the quality of specific types of data, such as linked data [32], IoT data [33], open government data [37] or more generic big data [28]. What constitutes "good data quality" is however depending on the context of its use and thus very much application-dependent. A city data infrastructure aiming to support a multitude data governance data assets value data quality data quality management refers to have a value depends on is ensured by is governed by of possible applications must provide sufficient metadata about the data quality, while it is left to the application to decide if the data is good enough to be used. This requirement is also grasped by the emerging "Smart Data" paradigm, according to which successful big data implementation has "a clear meaning (semantics), measurable data quality, and security (including data privacy standards)" [35]. This entails making data more accessible through adding metadata for structuring and integration across separate data silos and for storing information on data quality as well as benefitting from already available open and linked data.

Linked Data as Solution Mechanism
In the context of big data, linked data is both a specific type of data source and an approach for facilitating data integration and re-usage through providing clear meaning. This is essential, since only through understanding the context sensitive meaning of data can one assess whether data can be combined to generate value [27]. As Shiri points out: "the formalized, structured and organized nature of linked data and its specific applications, such as linked controlled vocabularies and knowledge organization systems, have the potential to provide a solid semantic foundation for the classification, representation, visualization and the organized presentation of big data" [38].
As the cross-case study on big data adoption in policy-making shows [16], public administrations use a variety of data sources from administrative data, official statistics, surveys, sensors and social media. The data used may be either open or restricted. These siloed data sources are typically accessed over platform-proprietary APIs.
To gain new insight about the data it is vital to fuse it from these different sources. This can be done by transforming the data into a more generic form, which is more accessible and provides standardized APIs on top of it. Such a generic API needs to provide a common way to exchange information between these sources and help the API consumer to understand the semantics and the meaning of the information. The W3C semantic web and linked data technology stack [24] aims at solving these problems. The RDF data model provides well-known schemas and ontologies as lingua franca, HTTP as transport layer, URIs as decentralized identifiers and multilingualism in its core. This makes it the data model of choice for bridging between data silos (cf. [27]).
In the past few years a lot of effort went into publishing best practices. In 2016, W3C released the "Data on the Web Best Practices" recommendation [38]: After roughly 10 years of open data movement [39], the document summarizes best practices and recommendations about how to publish open data, especially in the context of what needs to be taken into consideration to ensure that the published data is of maximum value for the public. Most of the recommendations are related to machine readability and discoverability of open data.
In the domain of schema and ontologies, several search engine giants launched schema.org [39], an initiative to "create and support a common set of schemas for structured data markup on web pages" Meanwhile schema.org seems to use "a simple RDF-like graph data model" and exposes its schema as embedded RDF. Over the past years schema.org had a huge impact; many sites started to include structured information within their websites and the support of first RDFa 2 and later JSON-LD 3 made people use semantic web technologies without being really aware of it. This increases visibility and perception of the semantic web as a whole. Developments that are still work in progress revolve around constraint languages; examples of that are Shapes Contraint Language (SHACL) and Shape Expressions (ShEx). RDF is a graph data model and by design it is possible to express any relationship between a subject and an object. In real world applications, it is often necessary to define structural constraints and validate RDF instance data against those. This can be done with both of the languages.
In the domain of IoT related ontologies there are even more options available and under discussion. Several groups are pushing their own concepts and ontologies, among others: Spatial Data on the Web Working Group's SSN Ontology, IoT + schema.org, Web of Things (WoT) Interest Group and the EU H2020's FIESTA-IoT project. It is too early to tell yet which of the proposed constraint languages and IoT ontologies will see wider adoption in the next years.

Implementation in the CPaaS.io Project
The smart city platform as developed by the CPaaS.io project is based on a common reference architecture, but is for pragmatic reasonsmainly in order to have instances up and running quickly in the two main regions of the projectimplemented on top of existing frameworks (see Fig. 1).

Fig. 2. Simplified CPaaS.io implementation architecture
The implementation in Europe is based on FIWARE [41], and the one in Japan on the u2 architecture [42]. The disadvantage of having two different platform implementations within the project is that data federation across instances becomes more challenging. However, in real life it cannot be expected that all cities will use the same 2  platform implementation anyway, so the two implementations within the project serve as a test regarding the real-world viability of the platform. For example in the domain of public transportation, it is important that innovations developed in one city can easily be transferred to another city.
The CPaaS.io Platform aims at fusing data from different sources, in particular the FIWARE platform in Europe and the u2 platform in Japan. From a data consumer perspective, it should not matter where the data is stored, CPaaS.io will facilitate discovery and access to information in these data storages via generic APIs. For that reason, CPaaS.io will use linked data and RDF to facilitate integration of data from platforms like FIWARE and u2. Neither of the two platforms is currently supporting RDF and linked data out of the box. CPaaS.io will integrate a semantic layer that enables mapping existing data to RDF. This semantic integration layer can be implemented in different phases and levels.
Initially the semantic layer will simply expose metadata as linked data, using common vocabularies and best practices as described in [38]. This enables users to query information about available data within the FIWARE and u2 platform as linked data. Access to this metadata layer will be done by providing a SPARQL endpoint that can be queried.
In a second step, data residing in FIWARE or u2 is mapped to RDF by extending the respective data model of each platform. In the case of FIWARE, this can be done by using the new NGSIv2 data model that supports JSON-LD representations. By providing appropriate tools and user interfaces, FIWARE users can thus map existing data to RDF representations. The ucode data model of u2 is close to RDF as it stores information in a triple-like data model. The semantic integration layer only needs to map internal ucode IDs to publicly used and dereferencable URIs, preferably as HTTP URIs to allow linked data usage like it has already been done for the Tokyo Metro real time data system.
To be able to query this kind of data, a SPARQL endpoint will proxy requests to the platform. CPaaS.io will provide a virtual-graph feature similar to what RDF graph databases provide to access relational data, using W3C standards like R2RML. Users will thus be able to run SPARQL queries on data residing in FIWARE or u2. In a final step FIWARE and u2 will implement its own SPARQL endpoint.

Conclusions and Outlook
While the public sector can be considered one of the most data-intensive sectors, actual use of big data in this sector is still rather limited. With the increased deployment of Internet of Things technologies and the international competition for cities to become smart, however, this is likely going to change. A lot of experimentation is still going on in this area to understand both the technologies as well as the applications that create real public value. To reap the potential benefits, cities will need an open city data infrastructure, where third parties can access the relevant city data, including data coming from the Internet of Things, and provide additional services on top. The platform that the CPaaS.io project is developing could serve as the basis for such an infrastructure if the two crucial issues that we highlighted in this paper are addressed: the ability to federate platform instances, and data quality. Linked data can serve as a possible mechanism to address both. The semantics behind linked data allow combining differently structured data from technically different platforms. And linked data can be used to annotate data sets with quality parameters so that an application using that data can decide if the data quality is good enough for the intended purpose. It is thus a fruitful approach for reaching the "smart data" paradigm.
Using linked data requires adequate vocabularies both for data integration into CPaaS.io and for re-usage by use case applications as well as for data dimensions and measures, accounting for the different types of data used in the project. Standards for such vocabularies and for validating data are still emerging; at this point in time none of these is well accepted yet. In the further course of the project, we will have to define which of the emerging vocabulary standards are suitable for the project and its use cases, and where we need to define our own. Furthermore, we plan to validate the applicability and the value of linked data, as well as the platform as a whole, in realworld use case implementations in European and Japanese cities. Both, the relationship between IoT and smart cities and the adoption of big data in the public sector in general require more research based on real applications. The CPaaS.io project will contribute to gaining new insights in these emerging research fields.