DBpedia SPARQL查询返回多个和重复的记录 [英] DBpedia SPARQL query returns multiple and duplicate records
问题描述
我对SPARQL还是很陌生,也为它现有的多种语法标准感到困惑。
我正在努力使用以下查询从DBpedia中获取唯一数据:
I am quite new to SPARQL and also becoming confused by the manifold syntax standards existing for it. I am struggling to fetch unique data from DBpedia using the following query:
SELECT DISTINCT ?Museum, ?name, ?abstract, ?thumbnail, ?latitude,
?longitude, ?photoCollection, ?website, ?homepage, ?wikilink
WHERE {
?Museum a dbpedia-owl:Museum ;
dbpprop:name ?name ;
dbpedia-owl:abstract ?abstract ;
dbpedia-owl:thumbnail ?thumbnail ;
geo:lat ?latitude ;
geo:long ?longitude ;
dbpprop:hasPhotoCollection ?photoCollection ;
dbpprop:website ?website ;
foaf:homepage ?homepage ;
foaf:isPrimaryTopicOf ?wikilink .
FILTER(langMatches(lang(?abstract),"EN"))
FILTER (langMatches(lang(?name),"EN"))
}
LIMIT 20
任何人都可以看到, Geffrye_Museum的条目
和 Institute_for_Museum_Research
在结果中重复,因为 Institute_for_Museum_Research
的名称和<$ c具有两个不同的值$ c> Geffrye_Museum 有两个经度值。在这两种重复的情况下,我希望第二个值被丢弃;例如,对于 Geffrye_Museum
,必须忽略经度值 -0.0762194
,而对于 Institute_for_Museum_Research
名称值 InstitutfürMuseumsforschung @en
必须忽略。
As anyone can see, the entries for Geffrye_Museum
and Institute_for_Museum_Research
are repeated in results because Institute_for_Museum_Research
has two different values for its name and Geffrye_Museum
has two longitude values. In both these duplicate cases, I want that the second values be discarded; i.e., for Geffrye_Museum
the longitude value -0.0762194
must be ignored, and for Institute_for_Museum_Research
the name value "Institut für Museumsforschung"@en
must be ignored.
请注意,已经对我想要的字段应用了过滤,这只是我想在查询级别本身解决的DBpedia中的大量数据。因此,当同一列有多个值时,如何使DBpedia仅返回第一个值?
Note that I am already applying filtering for the fields I want and this is simply abundance of data in DBpedia that I want to tackle at query level itself. So how can I make DBpedia return only the first value when there are multiple values for same column?
推荐答案
让我们看一个情况第一。对于Geffrye,重复数据的出现是因为数据中存在多个经度,如以下查询所示:
Let's look at one case first. In the case of the Geffrye the duplicate results occur because multiple longitudes are present in the data, as the following query demonstrates:
SELECT ?museum ?latitude ?longitude
WHERE {
VALUES ?museum { dbpedia:Geffrye_Museum }
?museum a dbpedia-owl:Museum ;
geo:lat ?latitude ;
geo:long ?longitude .
}
GROUP BY ?museum ?latitude ?longitude
产生
museum latitude longitude
http://dbpedia.org/resource/Geffrye_Museum 51.5317 -0.07663
http://dbpedia.org/resource/Geffrye_Museum 51.5317 -0.0762194
幸运的是,这很容易补救。如此问题中所述,您可以按结果的特征值对结果进行分组,然后进行采样,最小化,最大化等。通过值来获取所需的精确值。例如,如果您想要最大的经度,则可以在SELECT中使用 MAX(?longtude)作为?经度
,如下面的查询所示,它将生成一个值
Fortunately, this is easy enough to remedy. As discussed in this question you can group the results by their characteristic values, and then sample, minimize, maximize, etc., over the values to get precisely what you want. For instance, if you want the greatest valued longitude, you can use MAX(?longtude) as ?longitude
in your SELECT, as in the following query, which produces a single value.
SELECT ?museum ?latitude (MAX(?longitude) as ?longitude)
WHERE {
VALUES ?museum { dbpedia:Geffrye_Museum }
?museum a dbpedia-owl:Museum ;
geo:lat ?latitude ;
geo:long ?longitude .
}
GROUP BY ?museum ?latitude
当然,它需要一些知识来分组通过?纬度
并最大化?经度
。最好只是对?博物馆
进行分组并使用汇总投影来提取其他值,例如:
Of course, it presumes a bit of knowledge to group by ?latitude
and to maximize over ?longitude
. It's probably a better idea to just group by ?museum
and use aggregate projection to pull out the other values, as in:
SELECT ?museum (MAX(?latitude) as ?latitude) (MAX(?longitude) as ?longitude)
WHERE {
VALUES ?museum { dbpedia:Geffrye_Museum }
?museum a dbpedia-owl:Museum ;
geo:lat ?latitude ;
geo:long ?longitude .
}
GROUP BY ?museum
这样做对所有变量的处理会产生如下结果:
Taking this approach to all the variables produces something like this:
SELECT DISTINCT ?Museum
(SAMPLE(?name) as ?name)
(SAMPLE(?abstract) as ?abstract)
(SAMPLE(?thumbnail) as ?thumbnail)
(MAX(?latitude) as ?latitude)
(MAX(?longitude) as ?longitude)
(SAMPLE(?photoCollection) as ?photoCollection)
(SAMPLE(?website) as ?website)
(SAMPLE(?homepage) as ?homepage)
(SAMPLE(?wikilink) as ?wikilink)
WHERE {
?Museum a dbpedia-owl:Museum ;
dbpprop:name ?name ;
dbpedia-owl:abstract ?abstract ;
dbpedia-owl:thumbnail ?thumbnail ;
geo:lat ?latitude ;
geo:long ?longitude ;
dbpprop:hasPhotoCollection ?photoCollection ;
dbpprop:website ?website ;
foaf:homepage ?homepage ;
foaf:isPrimaryTopicOf ?wikilink .
FILTER(langMatches(lang(?abstract),"EN"))
FILTER (langMatches(lang(?name),"EN"))
}
GROUP BY ?Museum
LIMIT 20
必须在所有变量上使用合计投影似乎有些尴尬,但这会工作。但是,您也可以先在子查询中进行聚合,这将清理变量的投影,但要以子查询为代价。 (子查询不一定会对查询产生负面影响;实际上,情况可能恰好相反。不过查询本身更难阅读。)
It might seem a bit awkward to have to use the aggregate projection on all your variables, but it will work. However, you can also do the aggregation in a subquery first, and that will clean the variable projections up, at the cost of a subquery. (The subquery doesn't necessarily have a negative impact on the query; in fact it could be the opposite. The query itself is a bit harder to read, though.)
SELECT * WHERE {
# Select museums and a single latitude and longitude for them.
{
SELECT ?Museum (MAX(?longitude) as ?longitude) (MAX(?latitude) as ?latitude) WHERE {
?Museum a dbpedia-owl:Museum ;
geo:lat ?latitude ;
geo:long ?longitude .
}
GROUP BY ?Museum
}
# Get the rest of the properties of the museum.
?Museum dbpprop:name ?name ;
dbpedia-owl:abstract ?abstract ;
dbpedia-owl:thumbnail ?thumbnail ;
dbpprop:hasPhotoCollection ?photoCollection ;
dbpprop:website ?website ;
foaf:homepage ?homepage ;
foaf:isPrimaryTopicOf ?wikilink .
FILTER(langMatches(lang(?abstract),"EN"))
FILTER (langMatches(lang(?name),"EN"))
}
GROUP BY ?Museum
LIMIT 20
最后,由于您需要对名称以及地理坐标进行归一化,因此最终查询将类似于以下内容。在您的问题中,您只是说您想保留第一结果,但是结果没有特别的顺序,因此没有唯一的第一结果。有了手头的数据,您可以使用(MIN(?name)作为?name)
来获得想要的博物馆研究名称,但是如果您有一个特别的约束,您需要弄清楚如何使其更加具体。
Finally, since you need to normalize over names as well as geographic coordinates, your final query would be something like the following. In your question, you only said that you wanted to keep the "first result," but there's no particular order imposed on the results, so there is no unique "first result." With the data at hand, you can use (MIN(?name) as ?name)
and you'll get the name you wanted for the Institute for Museum Research, but if you have a particular constraint in mind, you'll need to figure out how to make that more specific.
SELECT * WHERE {
# Select museums and a single latitude, longitude, and name for them.
{
SELECT ?Museum
(MIN(?name) as ?name)
(MAX(?longitude) as ?longitude)
(MAX(?latitude) as ?latitude)
WHERE {
?Museum a dbpedia-owl:Museum ;
dbpprop:name ?name ;
geo:lat ?latitude ;
geo:long ?longitude .
FILTER (langMatches(lang(?name),"EN"))
}
GROUP BY ?Museum
}
# Get the rest of the properties of the museum.
?Museum dbpprop:name ?name ;
dbpedia-owl:abstract ?abstract ;
dbpedia-owl:thumbnail ?thumbnail ;
dbpprop:hasPhotoCollection ?photoCollection ;
dbpprop:website ?website ;
foaf:homepage ?homepage ;
foaf:isPrimaryTopicOf ?wikilink .
FILTER(langMatches(lang(?abstract),"EN"))
}
LIMIT 20
这篇关于DBpedia SPARQL查询返回多个和重复的记录的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!