1.4 THE LINKED OPEN DATA CLOUD
The LOD Cloud9 is a diagram that depicts the Linked Data datasets publicly available online. The diagram is updated regularly and it is maintained by the Insight Center for Data Analytics10 which is one of the biggest data science research center in Europe.
Everyone can upload datasets in the cloud but it will only be accepted and added to the cloud if it matches with the LOD Cloud principles, which are a slightly different version of the LD principles described in the section above. In order of being published, a dataset must respect the following rules.
1. There must be resolvable http:// (or https://) URIs.
2. They must resolve, with or without content negotiation, to RDF data in one of the popular RDF formats (RDFa, RDF/XML, Turtle, N-Triples).
3. The dataset must contain at least 1000 triples.
4. The dataset must be connected via RDF links to a dataset that is already in the diagram. This means, either your dataset must use URIs from the other dataset, or vice versa. They arbitrarily require at least 50 links.
5. Access the entire dataset must be possible via RDF crawling, via an RDF dump, or via a SPARQL endpoint.
Moreover, the maintainers of the LOD cloud developed an ad-hoc rating system for evaluating the quality of the published dataset. Although all the datasets respect the five rules described above, it is not assured that every dataset has the same characteristics. Generally, the evaluation metrics takes into account several metadata associated with the dataset like the presence or the absence of a SPARQL endpoint, the information about the author, the presence or the absence of the information of the author, the presence or the absence of metadata (and eventually the kind of metadata provided) and so forth. At the end of the process, each dataset is associated with a number of stars ranging from 1–5. The higher is the number of stars then higher is the quality of the dataset.
The Linked Data Cloud was, initially, created in May in 2007, at that time, it was composed of only 12 datasets. The LOD cloud contained the following.
• DBpedia which is a Linked Data version of Wikipedia.
• Geonames which contains a Linked Data version of geographical data.
• DBLP which contains a Linked Data version of academic data.
• Project Guttenberg and RDF Book Mashup which contains RDF data about books.
• Revyu which contains reviews in the form of LD.
• MusicBrainz, DBtune, and Jamendo which contain RDF data about the music business.
• FOAF (acronym of Friend of a Friend) which is an ontology containing LD that describes information about people, their relations, their activities, and, more generally, social network data.
• World Factbook and U.S. census data that contain governative data in the form of RDF triples.
The cloud shows which datasets are related to other datasets and a qualitative indication of the number of properties which connect a dataset to another. A thin line indicates that two datasets are connected with a low number of properties while a thick line represents a high number of relations connecting those datasets.
As time passed by, more and more institutions started to publish their data according to the Linked Data Principles described in Section 1.3 and the cloud got immense. After only half a year, the Linked Data Cloud doubled in terms of the number of the dataset published and reached the incredible number of 295 datasets on September 2011. There is no data about the dimension of the cloud in 2012 and 2013 but in 2014 the cloud counted up to 570 datasets. Again, no data is available for 2015 and 2016 but from 2017 the LOD cloud started to be updated regularly and there is plenty of information. The first record of 2017, dated January 26th, reported that the number of datasets increased to 1.146 (the double of the number of datasets present in the cloud in 2014). During the following years, the race of publishing Linked Data slowed down. In fact, recent updates of March 29, 2019 showed that the number of datasets present in the Linked Data Cloud is 1.239. Figure 1.5 represents the actual Linked Data Cloud. Despite the time elapsed and the increasing number of datasets, DBPedia is still the biggest and the best representative datasets of the LOD Cloud.
Figure 1.5: Linked Open Data Cloud (March 29, 2019).
As it can be perceived from Figure 1.5 the cloud is depicted as a partially connected graph. Each node of the graph represents a dataset and the link between two nodes indicates that some kind of property connects elements from different datasets. For helping the users during the navigation of the LOD cloud, given the fact that each dataset is different to others both in size and in the domain it covers, the maintainers of the LOD cloud decided to improve the graph through the adoption of visual notation. The number of triples contained in a dataset is used for calculating the dimension of its node while the domain is used to color the inside of the nodes. Moreover, to provide further aid during the navigation, each domain is subsequently divided into distinct subsections.
1.5 WEB OF DATA IN NUMBERS
The Linked Open Data Cloud is probably the best representation of the Web of data. However, Figure 1.5 does not reflect the volume of data it contains. Each node of the graph, despite the small dimension, contains an incredibly amount of data. For example, DBPedia alone contains more than 9.5 billion triples. Clearly, DBPedia is one of the biggest datasets around but it is not the only one that reaches an incredibly high number of triples. Geonames, LinkedGeoData, BabelNet are only a few other examples of huge datasets. Along with several metadata, the LOD cloud collects the number of triples which compose every single dataset. Unfortunately, the number of the triples is not present for all the datasets, but analyzing in details the dataset whose triples count is given, it results that the mean number of triples for each dataset is approximately equal to 176 million triples. It ends up in a total of 202 billion triples counted over 1151 datasets!
The number of Linked Data has risen in the last years also because the efforts of governments, to be more transparent and responsive to citizens’ demands, have been increasing [Attard et al., 2015], and this, in most cases, resulted in the publication of (linked) Open Data. Many online data portals exist and play a fundamental role in the expansion of the Web of Data. Portals like DataHub,11 the EU Open Data Portal,12 the European Data Portal,13 Data.Gov,14 Asia-Pacific SDG Data Portal15 act as repositories for all kind of datasets (agricultural, economy, education, environment, government, justice, transports, …) of different countries so that everyone could freely access those data. The only limitations to the usage of the data are defined by the licenses under which the data have been published but generally, they are not particularly restrictive. Thanks to those portals, the amount of data accessible through the Web is insane. Gathering together all the datasets those portal contains it is easy to exceed the threshold of one million datasets. However, despite the incredibly high number of datasets, their dimension is limited. Some dataset could be pretty big but the greatest part of them occupies a few kilobytes in space.
There is no clear information about the volume of data already present in the Web but it is easy to figure out that the number and the dimension of the datasets could only increase over the time reaching exabytes of information. Since that information hides a real treasure, in monetary terms, during the last period several data