SPARQL加快联合查询 [英] SPARQL Speed up federated query

查看:74
本文介绍了SPARQL加快联合查询的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有自己的数据集,我想在SPARQL中执行联合查询.这是查询:

I have my own dataset and I want to perform a federated query in SPARQL. Here is the query:

PREFIX : <http://myURIsNamespace#>
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX p: <http://www.wikidata.org/prop/>
PREFIX ps: <http://www.wikidata.org/prop/statement/>
PREFIX pq: <http://www.wikidata.org/prop/qualifier/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>

select * where { 
    ?bioentity :hasMutatedVersionOf ?gene .
    ?gene :partOf wd:Q430258 .

    SERVICE <https://query.wikidata.org/sparql> { 
        ?gene p:P644 ?statement; 
              wdt:P31 wd:Q7187 ;
              wdt:P703 wd:Q15978631 ;
              wdt:P1057 wd:Q430258 .
        ?statement ps:P644 ?start .
        ?statement pq:P659 wd:Q20966585 .

        ?gene p:P645 ?statement2. 
        ?statement2 ps:P645 ?end .
        ?statement2 pq:P659 wd:Q20966585 .
        FILTER (xsd:integer(?start)>21000000 && xsd:integer(?start)<30000000)  
    }

}

我通过graphDB SPARQL接口运行查询,但这确实很慢.返回8条记录需要花费一分钟多的时间.如果将查询分为两部分,它们的速度非常快.

I run the query via graphDB SPARQL interface but it's really really slow. It takes more than a minute to return 8 records. If I split the query in two parts, they are ridiculously fast.

查询#1

select * where { 
    ?bioentity :hasMutatedVersionOf ?gene .
    ?gene :partOf wd:Q430258 .          

}

56条记录在0.1秒内

56 records in 0.1s

查询#2

select * where { 
     SERVICE <https://query.wikidata.org/sparql> { 
        ?gene p:P644 ?statement; 
              wdt:P31 wd:Q7187 ;
              wdt:P703 wd:Q15978631 ;
              wdt:P1057 wd:Q430258 .
        ?statement ps:P644 ?start .
        ?statement pq:P659 wd:Q20966585 .

        ?gene p:P645 ?statement2. 
        ?statement2 ps:P645 ?end .
        ?statement2 pq:P659 wd:Q20966585 .
        FILTER (xsd:integer(?start)>21000000 && xsd:integer(?start)<30000000)  
    }       

}

在0.5秒内

158条记录

158 records in 0.5s

为什么联盟这么慢?有没有一种方法可以优化性能?

Why the is the federation so slow? Is there a way to optimize the performance?

推荐答案

简短答案

  1. 只需将您的SERVICE部分放在第一位,i. e.在?bioentity :hasMutatedVersionOf ?gene .

  1. Just place your SERVICE part first, i. e. before ?bioentity :hasMutatedVersionOf ?gene .

阅读有关该主题的出色文章(例如此内容书)

Read a good article on the topic (e. g. chapter 5 of this book)

上述文章的相关报价:

3.3.2查询优化和执行

查询运算符的执行顺序会显着影响整体查询评估成本. 除了重要的查询执行时间外,还有其他 联合方案中与查询相关的方面 优化:

The execution order of query operators significantly influences the overall query evaluation cost. Besides the important query execution time there are also other aspects in the federated scenario which are relevant for the query optimization:

使通信成本最小化. 数据源直接影响查询的性能 由于通信开销而无法执行.但是,减少 涉及的数据源数量折衷于完整性 结果.

Minimizing communication cost. The number of contacted data sources directly influences the performance of the query execution due to the communication overhead. However, reducing the number of involved data source trades off against completeness of results.

优化执行本地化.标准查询 链接数据源的接口通常只能 根据他们提供的数据回答查询.因此,加入 其他数据结果通常需要在查询颁发者处完成.如果 可能的话,更好的策略会转移部分结果 将操作合并到数据源,尤其是如果它们可以 并行执行.

Optimizing execution localization. The standard query interfaces of linked data sources are generally only capable of answering queries on their provided data. Therefore, joins with other data results usually need to be done at the query issuer. If possible at all, a better strategy will move parts of the result merging operations to the data sources, especially if they can be executed in parallel.

流式传输结果.检索完整的结果 在大型数据集上评估查询时,即使 一个优化的执行策略.因此,可以将结果返回为 只要它们可用,就可以通过尝试优化 首先返回相关结果.

Streaming results. Retrieving a complete result when evaluating a query on a large dataset may take a while even with a well optimized execution strategy. Thus one can return results as soon as they become available, which can be optimized by trying to return relevant results first.

长答案

示例数据

PREFIX : <http://myURIsNamespace#>
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX p: <http://www.wikidata.org/prop/>
PREFIX ps: <http://www.wikidata.org/prop/statement/>
PREFIX pq: <http://www.wikidata.org/prop/qualifier/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>

INSERT { ?gene rdf:type owl:Thing } 
WHERE {
    SERVICE <https://query.wikidata.org/sparql> { 
        ?gene p:P644 ?statement; 
              wdt:P31 wd:Q7187 ;
              wdt:P703 wd:Q15978631 ;
              wdt:P1057 wd:Q430258 .
        ?statement ps:P644 ?start .
        ?statement pq:P659 wd:Q20966585 .
        ?gene p:P645 ?statement2. 
        ?statement2 ps:P645 ?end .
        ?statement2 pq:P659 wd:Q20966585 .
        FILTER (xsd:integer(?start)>26000000 && xsd:integer(?start)<30000000)  
    }
}

三元组的总数为79.请注意,使用26000000代替了21000000.

The total number of triples is 79. Please note that 26000000 is used instead of 21000000.

查询1

PREFIX : <http://myURIsNamespace#>
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX p: <http://www.wikidata.org/prop/>
PREFIX ps: <http://www.wikidata.org/prop/statement/>
PREFIX pq: <http://www.wikidata.org/prop/qualifier/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>

SELECT * WHERE {
    ?gene rdf:type owl:Thing .
    SERVICE <https://query.wikidata.org/sparql> { 
        ?gene p:P644 ?statement; 
              wdt:P31 wd:Q7187 ;
              wdt:P703 wd:Q15978631 ;
              wdt:P1057 wd:Q430258 .
        ?statement ps:P644 ?start .
        ?statement pq:P659 wd:Q20966585 .
        ?gene p:P645 ?statement2. 
        ?statement2 ps:P645 ?end .
        ?statement2 pq:P659 wd:Q20966585 .
        FILTER (xsd:integer(?start)>20000000 && xsd:integer(?start)<30000000)  
    }
}

查询2

PREFIX : <http://myURIsNamespace#>
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX p: <http://www.wikidata.org/prop/>
PREFIX ps: <http://www.wikidata.org/prop/statement/>
PREFIX pq: <http://www.wikidata.org/prop/qualifier/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>

SELECT * WHERE {
    SERVICE <https://query.wikidata.org/sparql> { 
        ?gene p:P644 ?statement; 
              wdt:P31 wd:Q7187 ;
              wdt:P703 wd:Q15978631 ;
              wdt:P1057 wd:Q430258 .
        ?statement ps:P644 ?start .
        ?statement pq:P659 wd:Q20966585 .
        ?gene p:P645 ?statement2. 
        ?statement2 ps:P645 ?end .
        ?statement2 pq:P659 wd:Q20966585 .
        FILTER (xsd:integer(?start)>20000000 && xsd:integer(?start)<30000000)  
    }
    ?gene rdf:type owl:Thing
}

性能

+------------+---------+---------+
|            | Query 1 | Query 2 |
+------------+---------+---------+
| GraphDB    | 30 sec  |  1 sec  |
| Blazegraph |  1 sec  |  1 sec  |
+------------+---------+---------+

GraphDB行为

执行查询1,GraphDB对Wikidata¹执行79个不同的GET请求:

Executing Query 1, GraphDB performs 79 distinct GET requests to Wikidata¹:

这些请求是这种查询:

SELECT ?start ?statement ?end ?statement2 WHERE {
        <http://www.wikidata.org/entity/Q18031286> p:P644 ?statement; 
              wdt:P31 wd:Q7187 ;
              wdt:P703 wd:Q15978631 ;
              wdt:P1057 wd:Q430258 .
        ?statement ps:P644 ?start .
        ?statement pq:P659 wd:Q20966585 .
        <http://www.wikidata.org/entity/Q18031286> p:P645 ?statement2. 
        ?statement2 ps:P645 ?end .
        ?statement2 pq:P659 wd:Q20966585 .
        FILTER (xsd:integer(?start)>20000000 && xsd:integer(?start)<30000000)


有趣的是,在另一台计算机上,GraphDB执行另一种类型的GET请求:


It seems interesting, that on another machine, GraphDB performs GET requests of another kind:

GET /sparql?queryLn="Sparql"&query=<original_query_service_part>&$gene=<http://www.wikidata.org/entity/Q18031286>

在此请求中,使用了芝麻协议,这些绑定位于URL不是 SPARQL 1.1协议的一部分.

In this request, Sesame protocol is used, these bindings in URL are not a part of SPARQL 1.1 Protocol.

请求的确切类型可能取决于内部reuse.vars.in.subselects参数的值,该默认值在Windows和Linux上可能有所不同.

Perhaps the exact kind of a request depends on the value of the internal reuse.vars.in.subselects parameter, which default value is presumably different on Windows and on Linux.

火焰图行为

执行查询1,Blazegraph对Wikidata²执行单个POST请求:

Executing Query 1, Blazegraph performs single POST request to Wikidata²:

SELECT  ?gene ?statement ?start ?statement2 ?end
WHERE {
        ?gene p:P644 ?statement; 
              wdt:P31 wd:Q7187 ;
              wdt:P703 wd:Q15978631 ;
              wdt:P1057 wd:Q430258 .
        ?statement ps:P644 ?start .
        ?statement pq:P659 wd:Q20966585 .
        ?gene p:P645 ?statement2. 
        ?statement2 ps:P645 ?end .
        ?statement2 pq:P659 wd:Q20966585 .
        FILTER (xsd:integer(?start)>20000000 && xsd:integer(?start)<30000000)  

}
VALUES ( ?gene) {
( wd:Q14908148 ) ( wd:Q15320063 ) ( wd:Q17861651 ) ( wd:Q17917753 ) ( wd:Q17928333 )
( wd:Q18024923 ) ( wd:Q18026347 ) ( wd:Q18030710 ) ( wd:Q18031220 ) ( wd:Q18031457 )
( wd:Q18031551 ) ( wd:Q18031832 ) ( wd:Q18032918 ) ( wd:Q18033094 ) ( wd:Q18033798 )
( wd:Q18034311 ) ( wd:Q18035006 ) ( wd:Q18035085 ) ( wd:Q18035609 ) ( wd:Q18036516 )
( wd:Q18036676 ) ( wd:Q18037580 ) ( wd:Q18038385 ) ( wd:Q18038459 ) ( wd:Q18038737 )
( wd:Q18038763 ) ( wd:Q18039997 ) ( wd:Q18040291 ) ( wd:Q18041261 ) ( wd:Q18041415 )
( wd:Q18041558 ) ( wd:Q18045881 ) ( wd:Q18047232 ) ( wd:Q18047373 ) ( wd:Q18047918 )
( wd:Q18047966 ) ( wd:Q18048744 ) ( wd:Q18049145 ) ( wd:Q18049164 ) ( wd:Q18053139 )
( wd:Q18056540 ) ( wd:Q18057411 ) ( wd:Q18060804 ) ( wd:Q18060856 ) ( wd:Q18060876 )
( wd:Q18060905 ) ( wd:Q18060958 ) ( wd:Q20773708 ) ( wd:Q15312971 ) ( wd:Q17860819 )
( wd:Q17917713 ) ( wd:Q18026310 ) ( wd:Q18027015 ) ( wd:Q18031286 ) ( wd:Q18032599 )
( wd:Q18032797 ) ( wd:Q18035169 ) ( wd:Q18035627 ) ( wd:Q18039938 ) ( wd:Q18041207 )
( wd:Q18041512 ) ( wd:Q18041930 ) ( wd:Q18045491 ) ( wd:Q18045762 ) ( wd:Q18046301 )
( wd:Q18046472 ) ( wd:Q18046487 ) ( wd:Q18047149 ) ( wd:Q18047491 ) ( wd:Q18047719 )
( wd:Q18048527 ) ( wd:Q18049774 ) ( wd:Q18051886 ) ( wd:Q18053875 ) ( wd:Q18056212 )
( wd:Q18056538 ) ( wd:Q18065866 ) ( wd:Q20766978 ) ( wd:Q20781543 )
} 

结论

对于联合查询,由于远程模式的选择性未知,因此很难制定有效的执行计划.

With federated queries, it is hard to create effective execution plan, since selectivity of remote patterns is unknown.

在您的特定情况下,本地还是远程连接结果都不太重要,因为本地和远程结果集都很小.但是,在GraphDB中,远程连接结果的效果较差,因为GraphDB不会降低通信成本.

In your particular case, it should be not very important, whether to join results locally or remotely, because both local and remote resultsets are small. However, in GraphDB, joining results remotely is less effective, because GraphDB does not reduce communication costs.

¹为了创建屏幕截图,使用了<http://query.wikidata.org/sparql>而不是<https://query.wikidata.org/sparql>.

¹ For screenshots creation, <http://query.wikidata.org/sparql> instead of <https://query.wikidata.org/sparql> was used.

²在Blazegraph中,可以写hint:Query hint:optimizer "None"来确保顺序评估.

² In Blazegraph, one might write hint:Query hint:optimizer "None" to ensure sequential evaluation.

这篇关于SPARQL加快联合查询的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆