4754
Comment:
|
10533
Improve structure of headings
|
Deletions are marked like this. | Additions are marked like this. |
Line 1: | Line 1: |
#acl Thomas Goette:read,write All:read ## page was renamed from WikidataCheatSheet ## page was renamed from Datasets/WikidataExplanations |
|
Line 5: | Line 9: |
Standard prefixes can be found [[https://www.mediawiki.org/wiki/Wikibase/Indexing/RDF_Dump_Format#Prefixes_used|here]] |
|
Line 7: | Line 13: |
= Ranking of entities = For efficiency reasons, both numbers below should be precomputed and than the precomputed numbers added to the .nt file. For example, with new predicates ''<num-sitelinks>'' or ''<num-triples>''. == Ranking by number of sitelinks == A sitelink is a Wikipedia page that is about a certain Wikidata entity. The number of sitelinks is a good proxy for the popularity of an entity. For example, to get all German cities and their population ordered by the number of sitelinks, one can write the following. If you [[https://goo.gl/QhjjuL|run the query on WDQS]], note how in the top results there is strong correlation to population size. A practical challenge with this approach is that the schema:about triples are not part of the ''truthy'' excerpt of the Wikidata dataset. That is, one has to download the whole dataset, which is many times larger than the ''truthy'' version. It's not an actual problem, just more work. {{{ SELECT ?label ?population (COUNT(?label) AS ?popularity) WHERE { |
= Ranking entities = == Ranking by the number of sitelinks == A sitelink is a Wikipedia page that is about a certain Wikidata entity. The number of sitelinks is a good proxy for the popularity of an entity. For example, to get all German cities and their population ordered by the number of sitelinks, one can write the following. If you [[https://goo.gl/QhjjuL|run the query on the Wikidata Query Service]] (which only works with a relatively small LIMIT), note how in the top results there is strong correlation to population size. {{{ PREFIX wd: <http://www.wikidata.org/entity/> PREFIX wdt: <http://www.wikidata.org/prop/direct/> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> PREFIX schema: <http://schema.org/> PREFIX wikibase: <http://wikiba.se/ontology#> SELECT ?label ?population ?sitelinks WHERE { |
Line 23: | Line 30: |
?sitelink schema:about ?city FILTER(LANG(?label) = "en") } GROUP BY ?label ?population ORDER BY DESC(?popularity) LIMIT 100 }}} |
?city ^schema:about/wikibase:sitelinks ?sitelinks . FILTER (LANG(?label) = "en") } ORDER BY DESC(?sitelinks) }}} An alternative measure is the number of statements of a particular entity. This often gives unintuitive results, because there are many entities with a large number of "artificial" triples (for example, all kinds of IDs), without necessarily being particulary popular. {{{ PREFIX wd: <http://www.wikidata.org/entity/> PREFIX wdt: <http://www.wikidata.org/prop/direct/> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> PREFIX schema: <http://schema.org/> PREFIX wikibase: <http://wikiba.se/ontology#> SELECT ?label ?population ?statements WHERE { ?city wdt:P31 wd:Q515 . # ?city "instance of" "city" ?city wdt:P17 wd:Q183 . # ?city "country" "Germany" ?city wdt:P1082 ?population . # ?city "population" ?population ?city rdfs:label ?label . ?city ^schema:about/wikibase:statements ?statements . FILTER (LANG(?label) = "en") } ORDER BY DESC(?statements) }}} Note that the schema:about and wikibase:sitelinks and wikibase:statements information requires the ''full'' Wikidata dataset. These predicates are not part of the ''truthy''version. Also note that the dot before the FILTER clause is not necessary according to the SPARQL standard, but QLever currently (07.02.2020) still requires it. |
Line 38: | Line 62: |
PREFIX wd: <http://www.wikidata.org/entity/> PREFIX wdt: <http://www.wikidata.org/prop/direct/> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> PREFIX schema: <http://schema.org/> PREFIX wikibase: <http://wikiba.se/ontology#> |
|
Line 44: | Line 73: |
FILTER(LANG(?label) = "en") | FILTER (LANG(?label) = "en") |
Line 52: | Line 81: |
= Obtaining names for entities and predicates = |
= Obtaining names = == Names for entities == |
Line 70: | Line 95: |
FILTER(LANG(?label) = "en") }}} |
FILTER (LANG(?label) = "en") }}} == Names for predicates == |
Line 76: | Line 103: |
SELECT ?p (COUNT(?o) as ?count) WHERE { | PREFIX wd: <http://www.wikidata.org/entity/> PREFIX wikibase: <http://wikiba.se/ontology#> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> SELECT ?p (COUNT(?o) as ?count) (SAMPLE(?label) as ?pname) WHERE { |
Line 80: | Line 110: |
FILTER(LANG(?label) = "en") | FILTER (LANG(?label) = "en") |
Line 101: | Line 131: |
== Aliases of an entity == In the .nt file, Aliases are expressed as follows {{{ <http://www.wikidata.org/entity/Q567> <http://www.w3.org/2004/02/skos/core#altLabel> "Angela Dorothea Merkel"@de }}} There are also some other predicates, which provide alternative names, see [[https://stackoverflow.com/questions/40593452/how-to-retrieve-aliases-from-wikidata|this question on Stackoverflow]]. To get all aliases of an entity according to that list write something like this: {{{ PREFIX wd: <http://www.wikidata.org/entity/> PREFIX wdt: <http://www.wikidata.org/prop/direct/> PREFIX skos: <http://www.w3.org/2004/02/skos/core#> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> SELECT ?alias WHERE { wd:Q567 (wdt:P735|wdt:P734)?/(wdt:P1813|wdt:P1559|wdt:P1477|wdt:P1449|wdt:P742|skos:altLabel|rdfs:label) ?alias . FILTER (LANG(?alias) = "en") } }}} The predicates ''wdt:P735'' and ''wdt::P734'' yield the ''given name'' and ''family name'', respectively. Note that the objects of the predicates are entities, not literals. The ? after the (...) means that sequences of length 0 or 1 of these predicates are considered (that is using one of the predicates or not). With * one would get all sequences of length >= 0. For the query above, this would give the same result. With + one would get all sequences of length >= 1. For the query above, this would give fewer results, because we would not get the names of only ''wd:Q567''. The predicates ''wdt:P1813'', ''wdt:P1559'', ''wdt:P1477'', ''wdt:P1449'', ''wdt:P742'', ''skos:altLabel'' and ''rdfs:label'' yield ''short name'', ''name in native language'', ''birth name'', ''nickname'', ''pseudonym'', ''aliases'' and ''unique (per language)'' label, respectively. The objects for these predicates are literals. Note that the literals from ''skos:altLabel'' are in general not a superset of the literals from ''rdfs:label''. For the sake of completeness, one could include ''wdt:P297'' and ''wdt:P298'', which yield the two- and three-letter ISO codes for countries, or ''wdt:P1160'', which yields the ISO 4 abbreviation. There are probably some others which could be relevant as well in some cases. Note that ''skos'' stands for ''simple knowledge organization system''. The ''simple'' is obviously a joke, given that the full URI is http://www.w3.org/2004/02/skos/core#altLabel . == Aliases of a predicate == The following query gives names and aliases for any predicate that is connected (via one of the ''wikibase:...'' predicates above) to a ''wd:...'' entity. {{{ PREFIX skos: <http://www.w3.org/2004/02/skos/core#> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> PREFIX p: <http://www.wikidata.org/prop/> SELECT ?name_or_alias WHERE { ?predicate_entity ?p p:P40 . ?predicate_entity rdfs:label|skos:altLabel ?name_or_alias . FILTER (LANG(?name_or_alias) = "en") . } }}} |
|
Line 107: | Line 183: |
PREFIX ps: <http://www.wikidata.org/prop/statement/> | |
Line 108: | Line 185: |
PREFIX ps: <http://www.wikidata.org/prop/statement/> | PREFIX wd: <http://www.wikidata.org/entity/> |
Line 110: | Line 187: |
wd:Q90 p:P1082 ?statement . # P1082 = population | wd:Q90 p:P1082 ?statement . # P1082 = population |
Line 112: | Line 189: |
?statement pq:P585 ?time # P585 = point in time | ?statement pq:P585 ?time # P585 = point in time |
Line 117: | Line 194: |
= Problems with the Wikidata schema = == Query: City, country, population == 1. Many cities in Wikidata are not "instance of" (wdt:P31) "city" (wd:Q515), but "big city" (wd:Q4905452), which in turn is a "subclass of" (wdt:P279) city. Many of the example queries in Wikidata therefore use the following triple to specify that a variable is a city: {{{ ?city wdt:P31/wdt:P279* wd:Q515 }}} 2. Unlike Freebase, Wikidata has a property "country" (wdt:P17) to relate items (in particular: cities) to their countries. There is also "contains administrative territorial entity" (wdt:P150), but it is much less complete: the first query below has a result with ''10,560,170'' rows, the second query has a result with only ''21,344'' rows. {{{ PREFIX wdt: <http://www.wikidata.org/prop/direct/> SELECT ?city ?country WHERE { ?city wdt:P17 ?country . } }}} {{{ PREFIX wdt: <http://www.wikidata.org/prop/direct/> PREFIX wd: <http://www.wikidata.org/entity/> SELECT ?city ?country WHERE { ?city wdt:P31/wdt:P279* wd:Q515 . ?country wdt:P31 wd:Q6256 . ?city wdt:P17 ?country } }}} |
This page provides various explanations that are useful when working the Wikidata and its peculiar schema.
A good SPARQL tutorial specifically for Wikidata and its particular schema can be found here: https://www.wikidata.org/wiki/Wikidata:SPARQL_tutorial
Standard prefixes can be found here
Contents
Ranking entities
Ranking by the number of sitelinks
A sitelink is a Wikipedia page that is about a certain Wikidata entity. The number of sitelinks is a good proxy for the popularity of an entity. For example, to get all German cities and their population ordered by the number of sitelinks, one can write the following. If you run the query on the Wikidata Query Service (which only works with a relatively small LIMIT), note how in the top results there is strong correlation to population size.
PREFIX wd: <http://www.wikidata.org/entity/> PREFIX wdt: <http://www.wikidata.org/prop/direct/> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> PREFIX schema: <http://schema.org/> PREFIX wikibase: <http://wikiba.se/ontology#> SELECT ?label ?population ?sitelinks WHERE { ?city wdt:P31 wd:Q515 . # ?city "instance of" "city" ?city wdt:P17 wd:Q183 . # ?city "country" "Germany" ?city wdt:P1082 ?population . # ?city "population" ?population ?city rdfs:label ?label . ?city ^schema:about/wikibase:sitelinks ?sitelinks . FILTER (LANG(?label) = "en") } ORDER BY DESC(?sitelinks)
An alternative measure is the number of statements of a particular entity. This often gives unintuitive results, because there are many entities with a large number of "artificial" triples (for example, all kinds of IDs), without necessarily being particulary popular.
PREFIX wd: <http://www.wikidata.org/entity/> PREFIX wdt: <http://www.wikidata.org/prop/direct/> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> PREFIX schema: <http://schema.org/> PREFIX wikibase: <http://wikiba.se/ontology#> SELECT ?label ?population ?statements WHERE { ?city wdt:P31 wd:Q515 . # ?city "instance of" "city" ?city wdt:P17 wd:Q183 . # ?city "country" "Germany" ?city wdt:P1082 ?population . # ?city "population" ?population ?city rdfs:label ?label . ?city ^schema:about/wikibase:statements ?statements . FILTER (LANG(?label) = "en") } ORDER BY DESC(?statements)
Note that the schema:about and wikibase:sitelinks and wikibase:statements information requires the full Wikidata dataset. These predicates are not part of the truthyversion. Also note that the dot before the FILTER clause is not necessary according to the SPARQL standard, but QLever currently (07.02.2020) still requires it.
Ranking by number of triples
A simpler way to rank entities and which also works with the truthy data, is to order entities by the number of triples in which they are used as subject. This works well for some queries (like the following), but not so well for others (because there are quite a lot of relatively "unimportant" entities, but which have a lot of "phony" triples; TODO: give an example).
PREFIX wd: <http://www.wikidata.org/entity/> PREFIX wdt: <http://www.wikidata.org/prop/direct/> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> PREFIX schema: <http://schema.org/> PREFIX wikibase: <http://wikiba.se/ontology#> SELECT ?label ?population (COUNT(?label) AS ?popularity) WHERE { ?city wdt:P31 wd:Q515 . ?city wdt:P17 wd:Q183 . ?city wdt:P1082 ?population . ?city rdfs:label ?label . ?city ?p ?o . FILTER (LANG(?label) = "en") } GROUP BY ?label ?population ORDER BY DESC(?popularity) LIMIT 100
Obtaining names
Names for entities
The Wikidata Query Service (WDQS) has an automatic mechanism for obtaining the single best name for entity variables in the query. It does not work for predicate names, however. The mechanism is invoked by adding this line to the SPARQL query:
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
To obtain a single name of an entity "manually" (= without the mechanism above), one can add the following two lines to a query (Q90 is Paris). Note that without the language filter, one gets one name for each languages, for which a name has been specified for that entity in Wikidata (for popular entities, these are usually very many):
wd:Q90 rdfs:label ?label . FILTER (LANG(?label) = "en")
Names for predicates
Obtaining the name for a predicate is a bit more complicated. Here is how to obtain all predicates and their English name for an entity (again Q90), sorted by the number of triples with that predicate and that entity as subject:
PREFIX wd: <http://www.wikidata.org/entity/> PREFIX wikibase: <http://wikiba.se/ontology#> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> SELECT ?p (COUNT(?o) as ?count) (SAMPLE(?label) as ?pname) WHERE { wd:Q90 ?p ?o . ?x wikibase:claim ?p . ?x rdfs:label ?label . FILTER (LANG(?label) = "en") } GROUP BY ?p ORDER BY DESC(?count) LIMIT 100
There are a number of other wikibase: predicates that relate the different types of predicates to teach other
wd:P47 wikibase:claim p:P47 wd:P47 wikibase:directClaim wdt:P47 wd:P47 wikibase:novalue wdno:P47 wd:P47 wikibase:qualifier pq:P47 wd:P47 wikibase:qualifierValue pqv:P47 wd:P47 wikibase:reference pr:P47 wd:P47 wikibase:referenceValue prv:P47 wd:P47 wikibase:statementProperty ps:P47 wd:P47 wikibase:statementValue psv:P47
Aliases of an entity
In the .nt file, Aliases are expressed as follows
<http://www.wikidata.org/entity/Q567> <http://www.w3.org/2004/02/skos/core#altLabel> "Angela Dorothea Merkel"@de
There are also some other predicates, which provide alternative names, see this question on Stackoverflow. To get all aliases of an entity according to that list write something like this:
PREFIX wd: <http://www.wikidata.org/entity/> PREFIX wdt: <http://www.wikidata.org/prop/direct/> PREFIX skos: <http://www.w3.org/2004/02/skos/core#> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> SELECT ?alias WHERE { wd:Q567 (wdt:P735|wdt:P734)?/(wdt:P1813|wdt:P1559|wdt:P1477|wdt:P1449|wdt:P742|skos:altLabel|rdfs:label) ?alias . FILTER (LANG(?alias) = "en") }
The predicates wdt:P735 and wdt::P734 yield the given name and family name, respectively. Note that the objects of the predicates are entities, not literals.
The ? after the (...) means that sequences of length 0 or 1 of these predicates are considered (that is using one of the predicates or not). With * one would get all sequences of length >= 0. For the query above, this would give the same result. With + one would get all sequences of length >= 1. For the query above, this would give fewer results, because we would not get the names of only wd:Q567.
The predicates wdt:P1813, wdt:P1559, wdt:P1477, wdt:P1449, wdt:P742, skos:altLabel and rdfs:label yield short name, name in native language, birth name, nickname, pseudonym, aliases and unique (per language) label, respectively. The objects for these predicates are literals. Note that the literals from skos:altLabel are in general not a superset of the literals from rdfs:label.
For the sake of completeness, one could include wdt:P297 and wdt:P298, which yield the two- and three-letter ISO codes for countries, or wdt:P1160, which yields the ISO 4 abbreviation. There are probably some others which could be relevant as well in some cases.
Note that skos stands for simple knowledge organization system. The simple is obviously a joke, given that the full URI is http://www.w3.org/2004/02/skos/core#altLabel .
Aliases of a predicate
The following query gives names and aliases for any predicate that is connected (via one of the wikibase:... predicates above) to a wd:... entity.
PREFIX skos: <http://www.w3.org/2004/02/skos/core#> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> PREFIX p: <http://www.wikidata.org/prop/> SELECT ?name_or_alias WHERE { ?predicate_entity ?p p:P40 . ?predicate_entity rdfs:label|skos:altLabel ?name_or_alias . FILTER (LANG(?name_or_alias) = "en") . }
Statements and value-qualifier tuples for an entity
For example, all (population, point in time) pairs for Paris (Q90):
PREFIX p: <http://www.wikidata.org/prop/> PREFIX ps: <http://www.wikidata.org/prop/statement/> PREFIX pq: <http://www.wikidata.org/prop/qualifier/> PREFIX wd: <http://www.wikidata.org/entity/> SELECT ?value ?time WHERE { wd:Q90 p:P1082 ?statement . # P1082 = population ?statement ps:P1082 ?value . # P1082 = population ?statement pq:P585 ?time # P585 = point in time } ORDER BY DESC(?time)
Problems with the Wikidata schema
Query: City, country, population
1. Many cities in Wikidata are not "instance of" (wdt:P31) "city" (wd:Q515), but "big city" (wd:Q4905452), which in turn is a "subclass of" (wdt:P279) city. Many of the example queries in Wikidata therefore use the following triple to specify that a variable is a city:
?city wdt:P31/wdt:P279* wd:Q515
2. Unlike Freebase, Wikidata has a property "country" (wdt:P17) to relate items (in particular: cities) to their countries. There is also "contains administrative territorial entity" (wdt:P150), but it is much less complete: the first query below has a result with 10,560,170 rows, the second query has a result with only 21,344 rows.
PREFIX wdt: <http://www.wikidata.org/prop/direct/> SELECT ?city ?country WHERE { ?city wdt:P17 ?country . }
PREFIX wdt: <http://www.wikidata.org/prop/direct/> PREFIX wd: <http://www.wikidata.org/entity/> SELECT ?city ?country WHERE { ?city wdt:P31/wdt:P279* wd:Q515 . ?country wdt:P31 wd:Q6256 . ?city wdt:P17 ?country }