Open Data Landscape: A Global Perspective and a Focus on China

Governments are producing significant public data that, if made open, is expected to create enormous social and commercial value as well as improve the civil governance. Unleashing the true power of open public data requires a much better understanding of its ecosystem than is known currently. This paper surveys the global open data landscape by taking into account the Open Data Barometer (ODB) ranking system and its three sub-indexes - readiness, implementation and impact. These indexes are compared and analyzed on the basis of income levels of the ODB ranked countries. Finally, using air quality open data, data availability in developing countries like China is compared with countries of better practices such as UK and US. The comparison helps in understanding the current situation and barriers in opening data in China.


Introduction
Open data is data that can be used, re-used and distributed amongst the people without any legal, technological or social restriction [1]. It is, therefore, becoming a philosophy where data is accessible to the public for free. This approach induces a sense of accountability and transparency by building a bridge between the people and the government/organizations. Furthermore, it seeks to move beyond transparency, towards a problem solving platform in which open data can become a stepping stone to: drive more effective decision-making and efficient service delivery; spur economic activity; and empower citizens to take an active role in improving their own communities [2].
Researchers have looked at open data from many different perspectives. Some researchers focus on the relevant initiative in individual countries [3,4] or cities [5]. Other researchers have revealed the political, social and economic impact of open data [6,7]. Another interesting topic on open data concerns its business aspect. Hartmann et al. [8] looked at the types of business models amongst companies relying on data as their key business resource, and discussed capturing value through data driven business models. Magalhaes and Manley [9] examined 500 US firms that use Open Government Data (OGD) and classified them into three categories of business models: enablers, facilitators and integrators. Success stories of many practitioners of open data companies and governments have also been presented [2], [10]. In addition, integrating OGD into the Web of Linked Data has also been investigated [11] where Linked Data describes a method of publishing structured data such that it becomes more useful through semantic queries. 2

Research Motivation and Approach
The motivation behind this particular research is to tackle urban challenges from an open data perspective. According to UN [12], the percentage of population residing in urban areas is expected to increase from 30% in 1950 to 66% in 2050. Among the four groups of countries with different income levels [13], countries that are experiencing the fastest pace of urbanization since 1950 are uppermiddle-income countries such as Brazil, China and Mexico. Recently, the Chinese government has released a plan for integration of the Beijing-Tianjin-Hebei (Jing-Jin-Ji) regions, which together is the largest mega-region covering 216,000 sq. km and affecting more than 100 million people [14]. The unprecedented urbanization poses ever-increasing sustainable development challenges to cities and the newly urbanized population.
Our research on Open Data for Sustainable Urbanization (ODSU) aims to tackle the urban challenges of these developing countries with a focus on China. In particular, we want to understand whether and how the open data ecosystem can play a role towards a more sustainable and efficient urbanization process. This research is planned in three phases. In the first phase the global open data landscape is surveyed using secondary data. On global scale, countries are categorized based on four different income levels; at the county scale we focused particularly on China, US and UK. In the ongoing second phase of our research, we are collecting and analyzing an extensive list of open data urban applications around world's major cities, identifying the best practices and evaluating their impact on urbanization efficiency. In the third phase of this research, we plan to propose and implement appropriate open data use cases in China's Jing-Jin-Ji area as a pilot study. This paper, however, only looks into the first phase of the planned research where it attempts to survey the global landscape of open data by taking into account the Open Data Barometer (ODB) ranking index and its associated sub-indexes namely, readiness, implementation and impact [15]. An analysis is performed in order to understand the relationship between these indexes and the income levels of ODB ranked countries. Finally, to realize the current open data situation in China, a comparison is provided with the trendsetters in the ODB ranking index i.e. UK and US.  [18], to name few global and widely used benchmarks. However, each of these benchmarks serves a different purpose and focus. Susha et al. [19] suggest that ODB provides a more comprehensive perspective since it not only includes measures at various stages like readiness, implementation, and impact but also highlight the importance of involvement of major stakeholders and challenges throughout the open data process.

Choice of ODB
According to the authors [19], the ODB offers an insightful analysis of the entire chain (readiness, implementation, and impacts) and is a goal-oriented measure that can be used to realize how to modify implementation so as to accomplish a particular impact (economic, social, or political). However, the authors also suggest that most open data benchmarks (except for ODRA of the World Bank) produce results that are generic and ambiguous and the ranks of countries should not be expected to convey a strictly numeric position of a country but rather an approximation of reality. They specifically consider ODB more argumentative when it comes to open data diffusion.
For this particular research ODB is selected since it offers a snapshot of open data diffusion worldwide. Moreover, the research objective is to see the role of open data when dealing with unprecedented urbanization challenges particularly in developing countries like China. ODB, in addition to readiness, also offers perspective on the implementation and impact stages, which are considered useful when it comes to understanding the urban applications in developed countries and to develop guidelines for countries like China.
Furthermore, UK and US have been identified for comparison with China in this research since China is an example of the upper-middle-income countries with the fastest urbanization rate while US and UK are two high-income countries with high urbanization rate. Also, these two countries have been consistently highlighted as trendsetters by all major open data indexes. UK is ranked highest by the ODB [15], the ODI [17], the PSI Scoreboard [20] and identified as one of the trendsetters by Open Data Economy [18]. The US, in addition to being ranked 2nd by ODB and 8th by ODI, ranks highest in terms of data availability and data portal usability [18].

Open Data Overview
The Open Data Barometer (ODB) ranking system is a part of the World Wide Web Foundation's work on the common assessment methods for open data [15]. The weightage of each sub-index is given in Table 1. Using the ODB scores heat maps 1 are developed and presented (Fig 1-3) where the country colour depicts whether it's sub-index is high, moderate or low. The heat map was made so as to divide the total number of countries equally into 3 different layers so as to allow proper comparison. The lightest layer represents those countries in the bottom 1/3rd of the index ranking, the moderate layer represents those in the middle 1/3rd, and the darkest layer represents those countries in the top 1/3rd of the index ranking. As seen on the heat map ( Fig. 1) North America, a large part of Europe, Australia, Japan and South Korea have strong readiness sub-indexes. This means that these countries have strong government open data initiatives along with entrepreneur, business and citizen participation. By plotting the scored implementation subindex of all 86 countries on a heat map, it can be observed that a few more countries fall under the high range on the heat map for the implementation index including Russia, Chile and Brazil (Fig. 2). These countries have high implementation index even though they do not have a high readiness sub-index, which leads us to question the dependency between these two sub-indexes. Finally for the impact sub index (Fig. 3), it can be seen that the only countries that seem to have a high impact sub-index are those in Europe, US, Canada and New Zealand A general realization here is that many countries have a lower impact sub-index compared to their other two subindexes, questioning the impact that open data has on their political, economic and social standings of these countries. Moreover, another observation from the heat maps is that sub-indexes need not show complete dependency on one another. Following are few examples that reinforce this interpretation:  Brazil has a moderate readiness sub-index, a high implementation sub-index and a low impact sub-index;  Australia has a great readiness and implementation indexes but not too strong impact index doesn't;  China and India along with some more other Asian countries have decent readiness and implementation indexes, but low impact indexes;  Russia, Ecuador and Chile have moderate readiness

Open Data overview based on Income
The 86 countries listed in the ODB ranking system are divided into four levels of income categories as per World Bank [13] -low, lower-middle, upper-middle and high. Heat Maps for ODB score of all countries in all four categories are developed and presented (Fig. 4-7). The low-income group consists of 16 countries out of the 86 ODB ranked countries. The average ODB score of this group is 11.69. As can be seen from the heat map ( Fig. 4) mostly the countries have a low ODB score in this category with a few exceptions in Africa and Indian sub-continent. The lowermiddle income group (Fig. 5) consists of 14 countries with an average ODB score of 17.66. As seen from the heat maps ( Fig. 4-5) the low and lower-middle income-ODB rank categories constitute mostly of countries from Asia and Africa.
The Upper-Middle income layer comprises of 21 countries with an average ODB score of 28.57. From the heat map (Fig. 5), it can be seen that the countries of this group that are part of Asia and Africa generally have a lower ODB score than those that fall under South America and North America. The high-income group consists of 35 countries with an average ODB score of 57.14. An interesting observation is that all countries in the EU are not dark blue on the map. It can be seen that even though EU in general has a high open data standing [15], these practices are not standardized across the region. Furthermore, high ODB scores of countries such as the US and UK result in an overall data skew of this income class. Figure 8 compares the lowest, average and highest ODB scores for a particular income region. The graph shows that the most significant rise is from the upper middle layer to the high layer -the low, average and high bars increase by around 50%. Another interesting trend is observed when low to lower-middle and lower-middle to upper-middle layers are compared. Here the average and high values seem to increase by around 30% which is also a significant figure. Therefore, on a general note, it can be concluded that as the income increases, it is likely that ODB scores of the countries in that category also increase. This graph consists of the overall ODB score and makes an interesting case to look at the three sub-indexes separately to understand the sharp rise from the upper-middle to high income class.
Next, the three separate sub-indexes are analyzed for individual trends. The first sub-index, the readiness index (Fig. 9), has a uniform increase across all the income classes for the low, average and high values. As can be seen from the graph in figure 9, most changes are in the range of 33-47%. Also, the average rise in all income levels is considerably uniform as compared to the overall ODB index analysis. Therefore, it can be concluded that the sharp rise from the upper-middle to the high class of the overall ODB scores is not dependent on the readiness sub-index. Now, observing the graph for the implementation sub-index (Fig 10) shows that the lowest, average, and highest values across all the income categories are increasing, however, not uniformly. In fact, the rise from the upper-middle class to the high class appears to be very similar to that of the increase seen in the overall ODB analysis earlier. Since the implementation sub-index weighs 50% of the entire ODB index, it can be concluded that this jump plays an important role in the upsurge portrayed in the overall ODB analysis. Moreover, it can also be observed from figure 10 that the rise in the lower category of income is not following the same trend as the overall ODB and the readiness sub-index pattern (both have a constant increase in all levels). In fact, the percentage increase of the implementation index actually decreases for the lower-middle category. Also, from the graph it can be noticed that some countries, despite their reasonably high income, have not implemented open data as efficiently as one would expect them to do so. In this income group, 9 countries have a sub-index of less than 40, which is 25% of the total number of countries in the class.
The impact sub-index (Fig. 11), on the other hand is very different from the other two sub-indexes. Every income class has at least one country with a zero sub-index. This fact, along with the minimal averages, leads one to believe that this is by far the weakest index. Although countries are showing to have reasonable readiness and implementation sub-indexes, their impact sub-index is below par, proving that more open data initiatives are needed in the economic, social and political sectors. The only similarity with the other two sub-indexes is the surge from the upper-middle income class to the high income class. Although the impact sub-index weighs only 25% of the total ODB score of a country, this rise in the impact index plays a considerable role in the jump seen in the overall ODB analysis because the percentage increase is extremely large. Therefore, it can be concluded that the implementation sub-index and the impact sub-index are responsible for the increase in the overall ODB scores from the upper-middle income class to the high income class. In order to understand the overall landscape of the ODB ranked countries along with the representation of the three subindexes a line graph (Fig. 12) is developed. Here, the overall ODB country rank, 86 in total, is plotted on the x-axis whereas the y-axis highlights the ODB sub-index score. Hence, one point on the x-axis represents the country's ODB rank and the three corresponding coloured points on the y-axis are translated into the respective sub-index scores. The graph shows that countries have higher ODB ranks due to better performing sub-indexes. It can be seen from the line graph that there are certain anomalies. For example, Chile has an overall ODB rank of 15 with the readiness sub-index of 69, the implementation sub-index of 73 and the impact sub-index of 8. Here the implementation subindex is higher than the readiness sub-index. In fact, the impact sub-index is very low for the country as it is a part of the high income layer and has a reasonably high ODB rank as well. The open data hasn't had a noticeable impact on government efficiency, social policies and the economy mostly due to the lack of government initiatives and low entrepreneurial activity in the country.
Another similar example is that of Brazil with an overall of ODB rank of 21 with a moderate readiness sub-index, high implementation sub-index but a very low impact sub-index when compared with other counties in similar range of ODB rank. It is observed that even though the openness in Brazil's 2013 ODB results is pretty large for categories such as census, government spending, international trade etc. none of these categories actually adhere to the full open data standards hence giving it an overall moderate readiness factor. Also, it has been observed in case of Brazil that although there are a number of open data policies in place by government, the policies do not really pay much attention to the actual user perspective or overcoming the impediments of the use of open data [21]. This means that in order to improve the impact sub-index, the policies must be refined in a way that only fully open data is released benefiting the civil society at large.
We also looked at the percentage difference from lower income tier to higher income tier countries in terms of their income vs. ODB values. The World Bank classification of the four income country categories is based on GNI per capita [13]. Therefore, in Table 2 we listed these percentage differences of the average values of GNI, GNI per-capita and ODB. It can be seen that in the bottom tier jump (low to lower middle) and middle tier jump (lower middle to upper middle), the ODB value increase falls significantly behind the GNI increase. For example, at the low to lower middle jump, the average ODB value percentage increase is only half of the average GNI per capita increase, and nearly 1/3 of the average GNI increase. At the top tier, however, the numbers are much more consistent. In the jump from upper middle to high class, the average ODB value percentage increase is nearly the same as that of the average GNI, and is over 60% of the average GNI per capita increase. These numbers probably suggest that higher income countries generally put up more open data efforts, resulting in their open data status more likely matching their income level status. A more solid conclusion would require future work that looks into more details of the breadth and depth of various income and ODB parameters.

UK, US and China Comparison
After reviewing the global open data landscape, this paper compares highly ranked ODB countries like UK and USA with China in more depth. China has been taken in this analysis as it is an epitome of the developing world. Taking into consideration it's size, population and gross domestic product, it should indeed be releasing vast amounts of data, contributing to the society and making use of areas such as machine learning and business intelligence. However, government initiatives for open data pose a major challenge to this contribution. This section analyses the current open data situation of the country and the barriers that it needs to overcome.

ODB sub-indexes and their relationship
As observed previously, there is a substantial jump in the average ODB scores of countries from the upper middle income category to the countries of the high income. In the previous section, it has been concluded that this was mainly due to the implementation and impact sub-indexes. In this section, a comparison is performed between the overall ODB scores, readiness sub-index, implementation sub-index and impact subindex of UK, US and China (Fig 13). The US and UK can be considered examples of the high income layer, and China is an example of the upper-middle income layer. UK and US are ranked first and second respectively in the ODB rankings of 2014 [15]. The reason as to why UK and US rank so high is because of legislations in their respective countries [22]. In recent few years, these governments have launched a number of initiatives that basically target health, energy, climate, education, finance, public safety and development sectors thereby, improving open data initiatives [22].
China, on the other hand ranks 46 th in the ODB ranking. Although its readiness sub-index is just twice as low as the other two countries, it's lagging behind in the implementation and impact sub-indexes, which is 75% of the overall ODB weightage. Therefore, it can be observed that China needs to work on factors such as making datasets fully open data compliant in the fields of innovation, social policy and accountability, implementing strong open data legislation as well as maximizing impact in the fields of political, social, and economic importance. 1 11 countries without updated GNI values are not considered

National Data Portal
Both US [23] and UK [24] have created national data portals where data.gov and data.gov.uk have released 14,008 and 22,385 datasets respectively. The most common machine-readable formats for US and UK are XML and CSV, while popular non-machine readable formats for the two are HTML and PDF respectively. In addition, US also offers a significant number of datasets in zip format as compared to UK.
Unlike US and UK, China does not have a national data portal yet. However, certain open data is available through different agencies. One such example is availability of open data through National Bureau of Statistics of China (NBSC), which offers both a Chinese and an English version. The Chinese version is organized into monthly, quarterly, annual, regional, international and census data. The English version only consists of four categories i.e. monthly, quarterly, annual and regional data. All data from the English version is in machine-readable format. However, the same doesn't apply for the Chinese version, as it is only available in HTML format.

Dataset Example -Air Quality Index
In this section, we look at an example dataset common in all three countries, the air quality metrics, which measures the air pollutants level in daily air quality. The data is from the US Environment Protection Agency [25]  China's air quality metrics, on the other hand, has been reported with API since 2000. Figure 14 shows that in 2013 the number of reporting cities as well as the number of API datasets reduced dramatically due to the transition process from API to AQI. However, there is an obvious improvement afterwards as the number of cities with available AQI data increased to 335 in 2014 from 120 in 2012 with API data.
According to the ODB measurement methodology and weightage mechanism [15], we have calculated the ODB score for AQI data for the three countries. Our findings were, on a full score of 100, UK obtains 95 with the only limitation of no linked data URLs. U.S. and China receive scores of 80 and 50 respectively; the main reasons that pull China's score down are related to machinereadability and ability to download data. Since the Chinese dataset is already provided in HTML, technically it is not difficult to incorporate both of these aspects. Another reason for reduced score is an explicit link for open-license, which should be even easier to address.

Conclusion
We presented a heat-map analysis of the global open data landscape based on the ODB ranking system. We looked at the overall ODB indexes and its associated readiness, implementation and impact sub-indexes, for countries of the low, lower-middle, upper-middle and high income-levels, respectively. Our results show that in many countries the three sub-indexes do not exhibit dependency on each other. The impact sub-index is found to be most often the weakest part of the three, and in quite a few cases extremely low compared to the readiness and implementation indexes. This observation shows that, on one hand, governments around the world are establishing more and more open data initiatives and citizens are engaging in an increasing number of open data activities; on the other hand, tangible political, social and economic benefits from open data remain to be seen. This may be because we are still in the early stage of the entire open data life cycle where harvests are yet to be reaped. However, the extremely large gaps between the impact and other two sub-indexes in some countries may warrant a thoughtful review of the existing open data initiatives to more effectively align the investments with the expected results of open data.
In addition to the global open data perspective, we also provide a comparison of China with leading open data advocates, UK and US. We found that although China is lagging behind the two other countries in the three sub-indexes, gap in the implementation and impact sub-index is much larger than the gap in readiness sub-index. This shows the Chinese government has made good progress in facilitating open data initiatives from the policy and regulatory front, but more needs to be done especially on how to put those policies into execution, which is crucial for a positive impact of open data. Follow-up example of air quality data further confirmed China's clear progress in making data ready, but not yet providing data optimal for implementation. In summary, we believe that a number of natural steps can be taken to boost China's open data status to its next level, e.g., establishing a national and regional data portals will facilitate interested parties to find the right data; making available more machine-readable data will dramatically improve its usability and value.