如何优化返回可选属性的SPARQL查询? [英] How can I optimize a SPARQL query that returns optional properties?

查看:66
本文介绍了如何优化返回可选属性的SPARQL查询?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何优化如下所示的SPARQL查询?

How can I optimize a SPARQL query like the following?

此查询的目的是:

  1. 指定资源(国家资源在countryCode = "US"处)
  2. 获取在资源上定义的可选属性.

不幸的是,在父块之前先评估OPTIONAL块,这会使查询引擎加载所有国家/地区的所有数据.

Unfortunately, the OPTIONAL blocks are being evaluated before the parent block, which causes the query engine to load all data for all countries.

我想要的是类似LEFT OUTER JOIN的行为,但是查询引擎没有以这种方式进行处理.

What I want is something like a LEFT OUTER JOIN behavior, but the query engine is not handling it this way.

我该如何提高查询性能?

What can I do to improve query performance?

SELECT  *
WHERE
  { 
    ?type (rdfs:subClassOf)* gj:Country .
    ?this_0  rdf:type        ?type ;
             gn:countryCode  "US"
    # each of these blocks is executed as a standalone query in the engine
    OPTIONAL
      { ?this_0  gn:countryCode  ?countryCode_1}
    OPTIONAL
      { ?this_0  gn:name  ?name_2}
    OPTIONAL
      { ?this_0 gj:cscId  ?cscId_3} 
  }

我正在MarkLogic 8.4中使用SPARQL REST端点.

I am using the SPARQL REST endpoint in MarkLogic 8.4.

更新:

我尝试使用 optimize=2 选项进行查询,但是并没有显着提高性能:

I have tried querying with the optimize=2 option, but it did not give me a significant performance improvement:

/v1/graphs/sparql?optimize=2

相关: 更新2:

即使我设置了必需的可选属性之一,查询仍然运行缓慢:

Even if I make one of the optional properties required, the query is still running slow:

WHERE
  {
        ?type (rdfs:subClassOf)* gj:Country .
        ?this_0  rdf:type        ?type ;
             gn:countryCode  "US"; gj:cscId ?cscId_3 ;
  }

我是否需要做一些特殊的事情来索引此gj:cscId属性?

Do I need to do something special to index this gj:cscId property?

更新3:

这是查询控制台中的个人资料信息.

Here is the profile information from the Query Console.

更新4:

这是诊断跟踪信息:

2017-04-27 13:30:17.238 Info: [Event:id=SPARQL Value Frequencies] sessionKey=13846462700334370907 namedGraphs=0 values=
2017-04-27 13:30:17.238 Info: <triple-value-statistics count="154569757" unique-subjects="25445373" unique-predicates="104" unique-objects="67520361" xmlns="cts:triple-value-statistics">
2017-04-27 13:30:17.238 Info:   <triple-value-entries>
2017-04-27 13:30:17.238 Info:     <triple-value-entry count="181">
2017-04-27 13:30:17.238 Info:       <triple-value>http://kb.everest.cscglobal.com/geonames-jurisdiction/1.0/schema#Country</triple-value>
2017-04-27 13:30:17.238 Info:       <subject-statistics count="2" unique-predicates="2" unique-objects="2"/>
2017-04-27 13:30:17.238 Info:       <predicate-statistics count="0" unique-subjects="0" unique-objects="0"/>
2017-04-27 13:30:17.238 Info:       <object-statistics count="179" unique-subjects="179" unique-predicates="4"/>
2017-04-27 13:30:17.238 Info:     </triple-value-entry>
2017-04-27 13:30:17.238 Info:     <triple-value-entry count="15">
2017-04-27 13:30:17.238 Info:       <triple-value>http://www.w3.org/2000/01/rdf-schema#subClassOf</triple-value>
2017-04-27 13:30:17.238 Info:       <subject-statistics count="0" unique-predicates="0" unique-objects="0"/>
2017-04-27 13:30:17.238 Info:       <predicate-statistics count="15" unique-subjects="15" unique-objects="5"/>
2017-04-27 13:30:17.238 Info:       <object-statistics count="0" unique-subjects="0" unique-predicates="0"/>
2017-04-27 13:30:17.238 Info:     </triple-value-entry>
2017-04-27 13:30:17.238 Info:     <triple-value-entry count="8739716">
2017-04-27 13:30:17.238 Info:       <triple-value>http://www.w3.org/1999/02/22-rdf-syntax-ns#type</triple-value>
2017-04-27 13:30:17.238 Info:       <subject-statistics count="0" unique-predicates="0" unique-objects="0"/>
2017-04-27 13:30:17.238 Info:       <predicate-statistics count="8359510" unique-subjects="8341619" unique-objects="14"/>
2017-04-27 13:30:17.238 Info:       <object-statistics count="0" unique-subjects="0" unique-predicates="0"/>
2017-04-27 13:30:17.238 Info:     </triple-value-entry>
2017-04-27 13:30:17.238 Info:     <triple-value-entry count="8697064">
2017-04-27 13:30:17.238 Info:       <triple-value>http://www.geonames.org/ontology#countryCode</triple-value>
2017-04-27 13:30:17.238 Info:       <subject-statistics count="2" unique-predicates="2" unique-objects="2"/>
2017-04-27 13:30:17.238 Info:       <predicate-statistics count="8323137" unique-subjects="8323137" unique-objects="517"/>
2017-04-27 13:30:17.238 Info:       <object-statistics count="1" unique-subjects="1" unique-predicates="1"/>
2017-04-27 13:30:17.238 Info:     </triple-value-entry>
2017-04-27 13:30:17.238 Info:     <triple-value-entry count="2119305">
2017-04-27 13:30:17.238 Info:       <triple-value datatype="http://www.w3.org/2001/XMLSchema#string">US</triple-value>
2017-04-27 13:30:17.238 Info:       <subject-statistics count="0" unique-predicates="0" unique-objects="0"/>
2017-04-27 13:30:17.238 Info:       <predicate-statistics count="0" unique-subjects="0" unique-objects="0"/>
2017-04-27 13:30:17.238 Info:       <object-statistics count="2061783" unique-subjects="2061783" unique-predicates="3"/>
2017-04-27 13:30:17.238 Info:     </triple-value-entry>
2017-04-27 13:30:17.238 Info:     <triple-value-entry count="13946907">
2017-04-27 13:30:17.238 Info:       <triple-value>http://kb.everest.cscglobal.com/geonames-jurisdiction/1.0/schema#cscId</triple-value>
2017-04-27 13:30:17.238 Info:       <subject-statistics count="3" unique-predicates="3" unique-objects="3"/>
2017-04-27 13:30:17.238 Info:       <predicate-statistics count="11739004" unique-subjects="11739004" unique-objects="11739004"/>
2017-04-27 13:30:17.238 Info:       <object-statistics count="0" unique-subjects="0" unique-predicates="0"/>
2017-04-27 13:30:17.238 Info:     </triple-value-entry>
2017-04-27 13:30:17.238 Info:   </triple-value-entries>
2017-04-27 13:30:17.238 Info: </triple-value-statistics>
2017-04-27 13:30:17.239 Info: [Event:id=SPARQL AST] sessionKey=13846462700334370907
2017-04-27 13:30:17.239 Info:   initialPlan=SPARQLModule[
2017-04-27 13:30:17.239 Info:   Prolog[]
2017-04-27 13:30:17.239 Info:   SPARQLSelect[SPARQLProject[order()
2017-04-27 13:30:17.239 Info:       GraphNode[Var type 0]
2017-04-27 13:30:17.239 Info:       GraphNode[Var this_0 1]
2017-04-27 13:30:17.239 Info:       GraphNode[Var cscId_3 2]
2017-04-27 13:30:17.239 Info:       SPARQLLeftNestedLoopJoin[order() hash(1==1) scatter(1 = 1)
2017-04-27 13:30:17.239 Info:         SPARQLNestedLoopJoin[order() hash(1==1) scatter(1 = 1)
2017-04-27 13:30:17.239 Info:           SPARQLScatterJoin[order(0,1) hash(0==0) scatter(0 = 0)
2017-04-27 13:30:17.239 Info:             SPARQLZeroOrOne[
2017-04-27 13:30:17.239 Info:               GraphNode[Var type 0]
2017-04-27 13:30:17.239 Info:               GraphNode[IRI <http://kb.everest.cscglobal.com/geonames-jurisdiction/1.0/schema#Country>]
2017-04-27 13:30:17.239 Info:               SPARQLScatterOneOrMore[
2017-04-27 13:30:17.239 Info:                 GraphNode[Var type 0]
2017-04-27 13:30:17.239 Info:                 GraphNode[Var ANON16629111911678922088 0]
2017-04-27 13:30:17.239 Info:                 GraphNode[Var ANON7634081659815295853 1]
2017-04-27 13:30:17.239 Info:                 GraphNode[IRI <http://kb.everest.cscglobal.com/geonames-jurisdiction/1.0/schema#Country>]
2017-04-27 13:30:17.239 Info:                 TriplePattern[order(0,1) PSO
2017-04-27 13:30:17.239 Info:                   GraphNode[Var ANON16629111911678922088 0]
2017-04-27 13:30:17.239 Info:                   GraphNode[IRI <http://www.w3.org/2000/01/rdf-schema#subClassOf>]
2017-04-27 13:30:17.239 Info:                   GraphNode[Var ANON7634081659815295853 1]]]]
2017-04-27 13:30:17.239 Info:             TriplePattern[order(0,1) OPS
2017-04-27 13:30:17.239 Info:               GraphNode[Var this_0 1]
2017-04-27 13:30:17.239 Info:               GraphNode[IRI <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>]
2017-04-27 13:30:17.239 Info:               GraphNode[Var type 0]]]
2017-04-27 13:30:17.239 Info:           TriplePattern[order(1) SOP
2017-04-27 13:30:17.239 Info:             GraphNode[Var this_0 1]
2017-04-27 13:30:17.239 Info:             GraphNode[IRI <http://www.geonames.org/ontology#countryCode>]
2017-04-27 13:30:17.239 Info:             GraphNode[Literal "US"]]]
2017-04-27 13:30:17.239 Info:         TriplePattern[order(1,2) PSO
2017-04-27 13:30:17.239 Info:           GraphNode[Var this_0 1]
2017-04-27 13:30:17.239 Info:           GraphNode[IRI <http://kb.everest.cscglobal.com/geonames-jurisdiction/1.0/schema#cscId>]
2017-04-27 13:30:17.239 Info:           GraphNode[Var cscId_3 2]]]]]]
2017-04-27 13:30:17.239 Info: [Event:id=SPARQL Cost Analysis] sessionKey=13846462700334370907 optimize=1 r=3 t=1.28811 os=360 is=15 mutations=30 seed=7088858925989728751
2017-04-27 13:30:17.239 Info:   initialCost=(m:5.99223e+11,r:0,io:(52.9404/167736/1.17487e+09),cpu(1):(0/1.77017e+08/1.18652e+12),mem:8185,c:1.03266e+07,crd:[14,2.06178e+06,1.03266e+07])
2017-04-27 13:30:17.320 Info: [Event:id=SPARQL Cost Analysis] sessionKey=13846462700334370907 diff=-5.98971e+11 diff%=-99.958 r=0
2017-04-27 13:30:17.320 Info:   cost=(m:2.51757e+08,r:0,io:(52.9404/322.031/4.68406e+07),cpu(4):(0/159/3.51041e+07),mem:415.68,c:6.46969e+06,crd:[14,2.06178e+06,6.46969e+06])
2017-04-27 13:30:17.320 Info: [Event:id=SPARQL Cost Analysis] sessionKey=13846462700334370907 diff=-5.98971e+11 diff%=-99.958 r=1
2017-04-27 13:30:17.320 Info:   cost=(m:2.51757e+08,r:0,io:(52.9404/322.031/4.68406e+07),cpu(4):(0/159/3.51041e+07),mem:415.68,c:6.46969e+06,crd:[14,2.06178e+06,6.46969e+06])
2017-04-27 13:30:17.326 Info: [Event:id=SPARQL Cost Analysis] sessionKey=13846462700334370907 diff=-5.98971e+11 diff%=-99.958 r=2
2017-04-27 13:30:17.326 Info:   cost=(m:2.51757e+08,r:0,io:(52.9404/322.031/4.68406e+07),cpu(4):(0/159/3.51041e+07),mem:415.68,c:6.46969e+06,crd:[14,2.06178e+06,6.46969e+06])
2017-04-27 13:30:17.326 Info: [Event:id=SPARQL Cost Analysis] sessionKey=13846462700334370907
2017-04-27 13:30:17.326 Info:   bestCost=(m:2.51757e+08,r:0,io:(52.9404/322.031/4.68406e+07),cpu(4):(0/159/3.51041e+07),mem:415.68,c:6.46969e+06,crd:[14,2.06178e+06,6.46969e+06])
2017-04-27 13:30:17.326 Info: [Event:id=SPARQL AST] sessionKey=13846462700334370907
2017-04-27 13:30:17.326 Info:   plan=SPARQLModule[
2017-04-27 13:30:17.326 Info:   Prolog[]
2017-04-27 13:30:17.326 Info:   SPARQLSelect[SPARQLProject[order(1,0)
2017-04-27 13:30:17.326 Info:       GraphNode[Var type 0]
2017-04-27 13:30:17.326 Info:       GraphNode[Var this_0 1]
2017-04-27 13:30:17.326 Info:       GraphNode[Var cscId_3 2]
2017-04-27 13:30:17.326 Info:       SPARQLRightMergeJoin[order(1,0) hash(1==1) scatter()
2017-04-27 13:30:17.326 Info:         TriplePattern[order(1,2) PSO
2017-04-27 13:30:17.326 Info:           GraphNode[Var this_0 1]
2017-04-27 13:30:17.326 Info:           GraphNode[IRI <http://kb.everest.cscglobal.com/geonames-jurisdiction/1.0/schema#cscId>]
2017-04-27 13:30:17.326 Info:           GraphNode[Var cscId_3 2]]
2017-04-27 13:30:17.326 Info:         SPARQLHashJoin[order(1,0) hash(0==0) scatter()
2017-04-27 13:30:17.326 Info:           SPARQLZeroOrOne[
2017-04-27 13:30:17.326 Info:             GraphNode[Var type 0]
2017-04-27 13:30:17.326 Info:             GraphNode[IRI <http://kb.everest.cscglobal.com/geonames-jurisdiction/1.0/schema#Country>]
2017-04-27 13:30:17.326 Info:             SPARQLBloomOneOrMore[
2017-04-27 13:30:17.326 Info:               GraphNode[IRI <http://kb.everest.cscglobal.com/geonames-jurisdiction/1.0/schema#Country>]
2017-04-27 13:30:17.326 Info:               GraphNode[Var ANON7634081659815295853 1]
2017-04-27 13:30:17.326 Info:               GraphNode[Var ANON16629111911678922088 0]
2017-04-27 13:30:17.326 Info:               GraphNode[Var type 0]
2017-04-27 13:30:17.326 Info:               TriplePattern[order(0,1) PSO
2017-04-27 13:30:17.326 Info:                 GraphNode[Var ANON16629111911678922088 0]
2017-04-27 13:30:17.326 Info:                 GraphNode[IRI <http://www.w3.org/2000/01/rdf-schema#subClassOf>]
2017-04-27 13:30:17.326 Info:                 GraphNode[Var ANON7634081659815295853 1]]]]
2017-04-27 13:30:17.326 Info:           SPARQLMergeJoin[order(1,0) hash(1==1) scatter()
2017-04-27 13:30:17.326 Info:             TriplePattern[order(1) OPS
2017-04-27 13:30:17.326 Info:               GraphNode[Var this_0 1]
2017-04-27 13:30:17.326 Info:               GraphNode[IRI <http://www.geonames.org/ontology#countryCode>]
2017-04-27 13:30:17.326 Info:               GraphNode[Literal "US"]]
2017-04-27 13:30:17.326 Info:             TriplePattern[order(1,0) PSO
2017-04-27 13:30:17.326 Info:               GraphNode[Var this_0 1]
2017-04-27 13:30:17.326 Info:               GraphNode[IRI <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>]
2017-04-27 13:30:17.326 Info:               GraphNode[Var type 0]]]]]]]]

更新5:

在某些用例中,我发现我可以从查询中消除?type属性路径表达式.在这种情况下,性能提高了两个数量级:

In some use cases, I found that I could eliminate the ?type property path expression from the query. In one such case, performance improved by two orders of magnitude:

WHERE
  { 
    ?this_0  rdf:type        gj:Country ;
             gn:countryCode  "US"
    # each of these blocks is executed as a standalone query in the engine
    OPTIONAL
      { ?this_0  gn:countryCode  ?countryCode_1}
    OPTIONAL
      { ?this_0  gn:name  ?name_2}
    OPTIONAL
      { ?this_0 gj:cscId  ?cscId_3} 
  }

由于此解决方案更改了查询的输出,因此不能解决我们所有的用例.

Since this solution changes the output of the query, it doesn't solve all of our use cases.

似乎问题不在于OPTIONAL本身,而是与使查询计划程序混淆的属性路径表达式有关,因此,OPTIONAL块中的属性被独立查找(这是不起作用的).

It seems that the problem is not with the OPTIONALs per se, but has something to do with the property path expression confusing the query planner, so that the properties in the OPTIONAL blocks are looked up indepedently (which is not performant).

推荐答案

查询优化器依赖于使用统计信息来确定最佳操作顺序.通常,会使用限制性三元模式来使用散点联接来限制进一步的操作.

The query optimizer relies on using statistics to determine the best order for operations. Often there will be a restrictive triple pattern that can be used to restrict further operations using a scatter join.

就您而言,统计数据并未提供如此明显的限制性三重模式.通过在三元值统计信息输出中查看,字符串"US"作为对象出现2061783次-因此并没有严格的限制.

In your case, the statistics don't provide such an obvious restrictive triple pattern. You can see by looking in the triple value statistics output that the string "US" occurs 2061783 times as an object - so that's not terribly restrictive.

gj:Country IRI是限制性的(在对象位置是179倍),但是不幸的是,您需要在传递闭包运算符的右侧使用它.很难预测传递闭包运算符将返回多少结果,因为它很大程度上取决于实际数据.

The gj:Country IRI is restrictive (179 times in the object position), but unfortunately you need to use that on the right hand side of a transitive closure operator. It's very hard to predict how many results a transitive closure operator will return, as it depends a great deal on the actual data.

您会发现,使用如下所示的属性路径将使MarkLogic避免使用零或一运算符,这对性能的提升不大:

You'll find that using a property path like the one below will allow MarkLogic to avoid the zero-or-one operator, which may be a small performance boost:

?this_0 a/rdfs:subClassOf* gj:Country .

如果知道(例如)只有一个gj:Country(国家代码为"US")的国家/地区,则可以在查询的该部分添加一个限制,以向优化程序提示如何处理查询,即:

Going further if you know that (for instance) there is only one gj:Country with the country code of "US", you could add a limit to that part of the query to give the optimizer hints of how to treat the query, ie:

select * {
  {
    select * {
      ?this_0 a/rdfs:subClassOf* gj:Country .
      ?this_0  gn:countryCode  'US' .
    } limit 1
  }
  OPTIONAL { ?this_0 gj:cscId  ?cscId_3 } 
}

这篇关于如何优化返回可选属性的SPARQL查询?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆