1900
Comment:
|
4236
|
Deletions are marked like this. | Additions are marked like this. |
Line 1: | Line 1: |
## page was renamed from Datasets/WikidataExplanation = Wikidata: various explanations = |
This page provides various explanations that are useful when working the Wikidata and its peculiar schema. |
Line 4: | Line 3: |
== Obtaining names for entities and predicates == | A good SPARQL tutorial specifically for Wikidata and its particular schema can be found here: https://www.wikidata.org/wiki/Wikidata:SPARQL_tutorial <<TableOfContents(2)>> = Ranking of entities = For efficiency reasons, both numbers below should be precomputed and than the precomputed numbers added to the .nt file. For example, with new predicates <num-sitelinks> or <num-triples>. == Ranking by number of sitelinks == A sitelink is a Wikipedia page that is about a certain Wikidata entity. The number of sitelinks is a good proxy for the popularity of an entity. For example, to get all German cities and their population ordered by the number of sitelinks, one can write the following. If you [[https://goo.gl/QhjjuL|run the query on WDQS]], note how in the top results there is strong correlation to population size. A practical challenge with this approach is that the schema:about triples are not part of the ''truthy'' excerpt of the Wikidata dataset. That is, one has to download the whole dataset, which is many times larger than the ''truthy'' version. It's not an actual problem, just more work. {{{ SELECT ?label ?population (COUNT(?label) AS ?popularity) WHERE { ?city wdt:P31 wd:Q515 . # ?city "instance of" "city" ?city wdt:P17 wd:Q183 . # ?city "country" "Germany" ?city wdt:P1082 ?population . # ?city "population" ?population ?city rdfs:label ?label . ?sitelink schema:about ?city FILTER(LANG(?label) = "en") } GROUP BY ?label ?population ORDER BY DESC(?popularity) LIMIT 100 }}} == Ranking by number of triples == A simpler way to rank entities and which also works with the ''truthy'' data, is to order entities by the number of triples in which they are used as subject. This works well for some queries (like the following), but not so well for others (because there are quite a lot of relatively "unimportant" entities, but which have a lot of "phony" triples; TODO: give an example). {{{ SELECT ?label ?population (COUNT(?label) AS ?popularity) WHERE { ?city wdt:P31 wd:Q515 . ?city wdt:P17 wd:Q183 . ?city wdt:P1082 ?population . ?city rdfs:label ?label . ?city ?p ?o . FILTER(LANG(?label) = "en") } GROUP BY ?label ?population ORDER BY DESC(?popularity) LIMIT 100 }}} = Obtaining names for entities and predicates = |
This page provides various explanations that are useful when working the Wikidata and its peculiar schema.
A good SPARQL tutorial specifically for Wikidata and its particular schema can be found here: https://www.wikidata.org/wiki/Wikidata:SPARQL_tutorial
Contents
Ranking of entities
For efficiency reasons, both numbers below should be precomputed and than the precomputed numbers added to the .nt file. For example, with new predicates <num-sitelinks> or <num-triples>.
Ranking by number of sitelinks
A sitelink is a Wikipedia page that is about a certain Wikidata entity. The number of sitelinks is a good proxy for the popularity of an entity. For example, to get all German cities and their population ordered by the number of sitelinks, one can write the following. If you run the query on WDQS, note how in the top results there is strong correlation to population size.
A practical challenge with this approach is that the schema:about triples are not part of the truthy excerpt of the Wikidata dataset. That is, one has to download the whole dataset, which is many times larger than the truthy version. It's not an actual problem, just more work.
SELECT ?label ?population (COUNT(?label) AS ?popularity) WHERE { ?city wdt:P31 wd:Q515 . # ?city "instance of" "city" ?city wdt:P17 wd:Q183 . # ?city "country" "Germany" ?city wdt:P1082 ?population . # ?city "population" ?population ?city rdfs:label ?label . ?sitelink schema:about ?city FILTER(LANG(?label) = "en") } GROUP BY ?label ?population ORDER BY DESC(?popularity) LIMIT 100
Ranking by number of triples
A simpler way to rank entities and which also works with the truthy data, is to order entities by the number of triples in which they are used as subject. This works well for some queries (like the following), but not so well for others (because there are quite a lot of relatively "unimportant" entities, but which have a lot of "phony" triples; TODO: give an example).
SELECT ?label ?population (COUNT(?label) AS ?popularity) WHERE { ?city wdt:P31 wd:Q515 . ?city wdt:P17 wd:Q183 . ?city wdt:P1082 ?population . ?city rdfs:label ?label . ?city ?p ?o . FILTER(LANG(?label) = "en") } GROUP BY ?label ?population ORDER BY DESC(?popularity) LIMIT 100
Obtaining names for entities and predicates
The Wikidata Query Service (WDQS) has an automatic mechanism for obtaining the single best name for entity variables in the query. It does not work for predicate names, however. The mechanism is invoked by adding this line to the SPARQL query:
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
To obtain a single name of an entity "manually" (= without the mechanism above), one can add the following two lines to a query (Q90 is Paris). Note that without the language filter, one gets one name for each languages, for which a name has been specified for that entity in Wikidata (for popular entities, these are usually very many):
wd:Q90 rdfs:label ?label . FILTER(LANG(?label) = "en")
Obtaining the name for a predicate is a bit more complicated. Here is how to obtain all predicates and their English name for an entity (again Q90), sorted by the number of triples with that predicate and that entity as subject:
SELECT ?p (COUNT(?o) as ?count) WHERE { wd:Q90 ?p ?o . ?x wikibase:claim ?p . ?x rdfs:label ?label . FILTER(LANG(?label) = "en") } GROUP BY ?p ORDER BY DESC(?count) LIMIT 100
There are a number of other wikibase: predicates that relate the different types of predicates to teach other
wd:P47 wikibase:claim p:P47 wd:P47 wikibase:directClaim wdt:P47 wd:P47 wikibase:novalue wdno:P47 wd:P47 wikibase:qualifier pq:P47 wd:P47 wikibase:qualifierValue pqv:P47 wd:P47 wikibase:reference pr:P47 wd:P47 wikibase:referenceValue prv:P47 wd:P47 wikibase:statementProperty ps:P47 wd:P47 wikibase:statementValue psv:P47