为什么neo4j警告:“此查询在断开连接的模式之间建立笛卡尔积"? [英] Why does neo4j warn: "This query builds a cartesian product between disconnected patterns"?

查看:1537
本文介绍了为什么neo4j警告:“此查询在断开连接的模式之间建立笛卡尔积"?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

从CSV导入数据后,我以简单和正常的方式定义了基因和染色体这两个实体之间的关系:

I'm defining the relationship between two entities, Gene and Chromosome, in what I think is the simple and normal way, after importing the data from CSV:

MATCH (g:Gene),(c:Chromosome)
WHERE g.chromosomeID = c.chromosomeID
CREATE (g)-[:PART_OF]->(c);

但是,当我这样做时,neo4j(浏览器用户界面)抱怨:

Yet, when I do so, neo4j (browser UI) complains:

此查询在断开连接的模式之间建立笛卡尔积. 如果查询的一部分包含多个断开连接的模式,这将在所有这些部分之间建立笛卡尔积.这可能会产生大量数据并减慢查询处理.尽管偶尔会出现这种情况,但通常可以通过避免在不同部分之间添加关系或使用可选匹配(标识符为(c))来重新构造避免使用此叉积的查询.

This query builds a cartesian product between disconnected patterns. If a part of a query contains multiple disconnected patterns, this will build a cartesian product between all those parts. This may produce a large amount of data and slow down query processing. While occasionally intended, it may often be possible to reformulate the query that avoids the use of this cross product, perhaps by adding a relationship between the different parts or by using OPTIONAL MATCH (identifier is: (c)).

我看不出问题是什么.染色体ID是一个非常简单的外键.

I don't see what the issue is. chromosomeID is a very straightforward foreign key.

推荐答案

浏览器告诉您:

  1. 它通过在每个Gene实例和每个Chromosome实例之间进行比较来处理您的查询.如果您的数据库具有G基因和C染色体,则查询的复杂度为O(GC).例如,如果我们正在研究人类基因组,则有46条染色体,也许还有25000个基因,因此数据库必须进行1150000比较.
  2. 您可能可以通过更改查询来提高复杂性(和性能).例如,如果我们已创建:Gene(chromosomeID)上的索引,并更改查询,以使我们最初仅在基数最小(46条染色体)的节点上进行匹配,因此我们只能进行O(G)(或25000)比较" -而这些比较实际上是快速的索引查找!这种方法应该快得多.

  1. It is handling your query by doing a comparison between every Gene instance and every Chromosome instance. If your DB has G genes and C chromosomes, then the complexity of the query is O(GC). For instance, if we are working with the human genome, there are 46 chromosomes and maybe 25000 genes, so the DB would have to do 1150000 comparisons.
  2. You might be able to improve the complexity (and performance) by altering your query. For example, if we created an index on :Gene(chromosomeID), and altered the query so that we initially matched just on the node with the smallest cardinality (the 46 chromosomes), we would only do O(G) (or 25000) "comparisons" -- and those comparisons would actually be quick index lookups! This is approach should be much faster.

创建索引后,我们可以使用以下查询:

Once we have created the index, we can use this query:

MATCH (c:Chromosome)
WITH c
MATCH (g:Gene) 
WHERE g.chromosomeID = c.chromosomeID
CREATE (g)-[:PART_OF]->(c);

它使用WITH子句强制第一个MATCH子句首先执行,从而避免了笛卡尔积.第二个MATCH(和WHERE)子句使用第一个MATCH子句的结果和索引来快速获取属于每个染色体的确切基因.

It uses a WITH clause to force the first MATCH clause to execute first, avoiding the cartesian product. The second MATCH (and WHERE) clause uses the results of the first MATCH clause and the index to quickly get the exact genes that belong to each chromosome.

[更新]

最初编写此答案时,WITH子句很有帮助.现在,即使省略了WITH,在较新版本的neo4j(如4.0.3)中的Cypher计划器也会生成相同的计划,并且不会创建笛卡尔乘积.您始终可以 PROFILE 查询的两个版本,以查看使用/不使用WITH的效果.

The WITH clause was helpful when this answer was originally written. The Cypher planner in newer versions of neo4j (like 4.0.3) now generate the same plan even if the WITH is omitted, and without creating a cartesian product. You can always PROFILE both versions of your query to see the effect with/without the WITH.

这篇关于为什么neo4j警告:“此查询在断开连接的模式之间建立笛卡尔积"?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆