DBpedia SPARQL查询返回多个和重复的记录 [英] DBpedia SPARQL query returns multiple and duplicate records

查看:89
本文介绍了DBpedia SPARQL查询返回多个和重复的记录的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对SPARQL还是很陌生,也为它现有的多种语法标准感到困惑。
我正在努力使用以下查询从DBpedia中获取唯一数据:

I am quite new to SPARQL and also becoming confused by the manifold syntax standards existing for it. I am struggling to fetch unique data from DBpedia using the following query:

SELECT DISTINCT ?Museum, ?name, ?abstract, ?thumbnail, ?latitude,
   ?longitude, ?photoCollection, ?website, ?homepage, ?wikilink
WHERE { 
  ?Museum a dbpedia-owl:Museum ; 
          dbpprop:name ?name ; 
          dbpedia-owl:abstract ?abstract ; 
          dbpedia-owl:thumbnail ?thumbnail ; 
          geo:lat ?latitude ;  
          geo:long ?longitude ; 
          dbpprop:hasPhotoCollection ?photoCollection ;
          dbpprop:website ?website ; 
          foaf:homepage ?homepage ; 
          foaf:isPrimaryTopicOf ?wikilink .
  FILTER(langMatches(lang(?abstract),"EN")) 
  FILTER (langMatches(lang(?name),"EN"))
}
LIMIT 20

SPARQL结果

任何人都可以看到, Geffrye_Museum的条目 Institute_for_Museum_Research 在结果中重复,因为 Institute_for_Museum_Research 的名称和<$ c具有两个不同的值$ c> Geffrye_Museum 有两个经度值。在这两种重复的情况下,我希望第二个值被丢弃;例如,对于 Geffrye_Museum ,必须忽略经度值 -0.0762194 ,而对于 Institute_for_Museum_Research 名称值 InstitutfürMuseumsforschung @en 必须忽略。

As anyone can see, the entries for Geffrye_Museum and Institute_for_Museum_Research are repeated in results because Institute_for_Museum_Research has two different values for its name and Geffrye_Museum has two longitude values. In both these duplicate cases, I want that the second values be discarded; i.e., for Geffrye_Museum the longitude value -0.0762194 must be ignored, and for Institute_for_Museum_Research the name value "Institut für Museumsforschung"@en must be ignored.

请注意,已经对我想要的字段应用了过滤,这只是我想在查询级别本身解决的DBpedia中的大量数据。因此,当同一列有多个值时,如何使DBpedia仅返回第一个值?

Note that I am already applying filtering for the fields I want and this is simply abundance of data in DBpedia that I want to tackle at query level itself. So how can I make DBpedia return only the first value when there are multiple values for same column?

推荐答案

让我们看一个情况第一。对于Geffrye,重复数据的出现是因为数据中存在多个经度,如以下查询所示:

Let's look at one case first. In the case of the Geffrye the duplicate results occur because multiple longitudes are present in the data, as the following query demonstrates:

SELECT ?museum ?latitude ?longitude
WHERE { 
  VALUES ?museum { dbpedia:Geffrye_Museum }
  ?museum a dbpedia-owl:Museum ; 
          geo:lat ?latitude ;  
          geo:long ?longitude .
}
GROUP BY ?museum ?latitude ?longitude

SPARQL结果

产生

museum                                     latitude longitude
http://dbpedia.org/resource/Geffrye_Museum 51.5317  -0.07663
http://dbpedia.org/resource/Geffrye_Museum 51.5317  -0.0762194

幸运的是,这很容易补救。如此问题中所述,您可以按结果的特征值对结果进行分组,然后进行采样,最小化,最大化等。通过值来获取所需的精确值。例如,如果您想要最大的经度,则可以在SELECT中使用 MAX(?longtude)作为?经度,如下面的查询所示,它将生成一个值

Fortunately, this is easy enough to remedy. As discussed in this question you can group the results by their characteristic values, and then sample, minimize, maximize, etc., over the values to get precisely what you want. For instance, if you want the greatest valued longitude, you can use MAX(?longtude) as ?longitude in your SELECT, as in the following query, which produces a single value.

SELECT ?museum ?latitude (MAX(?longitude) as ?longitude)
WHERE { 
  VALUES ?museum { dbpedia:Geffrye_Museum }
  ?museum a dbpedia-owl:Museum ; 
          geo:lat ?latitude ;  
          geo:long ?longitude .
}
GROUP BY ?museum ?latitude

SPARQL结果

当然,它需要一些知识来分组通过?纬度并最大化?经度。最好只是对?博物馆进行分组并使用汇总投影来提取其他值,例如:

Of course, it presumes a bit of knowledge to group by ?latitude and to maximize over ?longitude. It's probably a better idea to just group by ?museum and use aggregate projection to pull out the other values, as in:

SELECT ?museum (MAX(?latitude) as ?latitude) (MAX(?longitude) as ?longitude)
WHERE { 
  VALUES ?museum { dbpedia:Geffrye_Museum }
  ?museum a dbpedia-owl:Museum ; 
          geo:lat ?latitude ;  
          geo:long ?longitude .
}
GROUP BY ?museum

SPARQL结果

这样做对所有变量的处理会产生如下结果:

Taking this approach to all the variables produces something like this:

SELECT DISTINCT ?Museum
  (SAMPLE(?name) as ?name)
  (SAMPLE(?abstract) as ?abstract)
  (SAMPLE(?thumbnail) as ?thumbnail)
  (MAX(?latitude) as ?latitude)
  (MAX(?longitude) as ?longitude)
  (SAMPLE(?photoCollection) as ?photoCollection)
  (SAMPLE(?website) as ?website)
  (SAMPLE(?homepage) as ?homepage)
  (SAMPLE(?wikilink) as ?wikilink)
WHERE { 
  ?Museum a dbpedia-owl:Museum ; 
          dbpprop:name ?name ; 
          dbpedia-owl:abstract ?abstract ; 
          dbpedia-owl:thumbnail ?thumbnail ; 
          geo:lat ?latitude ;  
          geo:long ?longitude ; 
          dbpprop:hasPhotoCollection ?photoCollection ;
          dbpprop:website ?website ; 
          foaf:homepage ?homepage ; 
          foaf:isPrimaryTopicOf ?wikilink .
  FILTER(langMatches(lang(?abstract),"EN")) 
  FILTER (langMatches(lang(?name),"EN"))
}
GROUP BY ?Museum
LIMIT 20

SPARQL结果

必须在所有变量上使用合计投影似乎有些尴尬,但这会工作。但是,您也可以先在子查询中进行聚合,这将清理变量的投影,但要以子查询为代价。 (子查询不一定会对查询产生负面影响;实际上,情况可能恰好相反。不过查询本身更难阅读。)

It might seem a bit awkward to have to use the aggregate projection on all your variables, but it will work. However, you can also do the aggregation in a subquery first, and that will clean the variable projections up, at the cost of a subquery. (The subquery doesn't necessarily have a negative impact on the query; in fact it could be the opposite. The query itself is a bit harder to read, though.)

SELECT * WHERE { 
  # Select museums and a single latitude and longitude for them.
  {
    SELECT ?Museum (MAX(?longitude) as ?longitude) (MAX(?latitude) as ?latitude) WHERE {
      ?Museum a dbpedia-owl:Museum ;
              geo:lat ?latitude ;
              geo:long ?longitude .
    }
    GROUP BY ?Museum
  }
  # Get the rest of the properties of the museum.
  ?Museum dbpprop:name ?name ;
          dbpedia-owl:abstract ?abstract ; 
          dbpedia-owl:thumbnail ?thumbnail ; 
          dbpprop:hasPhotoCollection ?photoCollection ;
          dbpprop:website ?website ; 
          foaf:homepage ?homepage ; 
          foaf:isPrimaryTopicOf ?wikilink .
  FILTER(langMatches(lang(?abstract),"EN")) 
  FILTER (langMatches(lang(?name),"EN"))
}
GROUP BY ?Museum
LIMIT 20

SPARQL结果

最后,由于您需要对名称以及地理坐标进行归一化,因此最终查询将类似于以下内容。在您的问题中,您只是说您想保留第一结果,但是结果没有特别的顺序,因此没有唯一的第一结果。有了手头的数据,您可以使用(MIN(?name)作为?name)来获得想要的博物馆研究名称,但是如果您有一个特别的约束,您需要弄清楚如何使其更加具体。

Finally, since you need to normalize over names as well as geographic coordinates, your final query would be something like the following. In your question, you only said that you wanted to keep the "first result," but there's no particular order imposed on the results, so there is no unique "first result." With the data at hand, you can use (MIN(?name) as ?name) and you'll get the name you wanted for the Institute for Museum Research, but if you have a particular constraint in mind, you'll need to figure out how to make that more specific.

SELECT * WHERE { 
  # Select museums and a single latitude, longitude, and name for them.
  {
    SELECT ?Museum 
           (MIN(?name) as ?name)
           (MAX(?longitude) as ?longitude)
           (MAX(?latitude) as ?latitude)
    WHERE {
      ?Museum a dbpedia-owl:Museum ;
              dbpprop:name ?name ;
              geo:lat ?latitude ;
              geo:long ?longitude .
      FILTER (langMatches(lang(?name),"EN"))
    }
    GROUP BY ?Museum
  }
  # Get the rest of the properties of the museum.
  ?Museum dbpprop:name ?name ;
          dbpedia-owl:abstract ?abstract ; 
          dbpedia-owl:thumbnail ?thumbnail ; 
          dbpprop:hasPhotoCollection ?photoCollection ;
          dbpprop:website ?website ; 
          foaf:homepage ?homepage ; 
          foaf:isPrimaryTopicOf ?wikilink .
  FILTER(langMatches(lang(?abstract),"EN")) 
}
LIMIT 20

SPARQL结果

这篇关于DBpedia SPARQL查询返回多个和重复的记录的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆