如何使用多节点匹配来优化Neo4j Cypher查询(笛卡尔积) [英] How to optimize Neo4j Cypher queries with multiple node matches (Cartesian Product)

查看:1102
本文介绍了如何使用多节点匹配来优化Neo4j Cypher查询(笛卡尔积)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在尝试合并三个数据集用于分析目的。我正在使用某些常用字段来建立数据集之间的连接。为了创建连接,我尝试使用以下类型的查询:

  MATCH(p1:Person),(p2: Person)
WHERE p1.email = p2.email AND p1.name = p2.name AND p1<> p2
CREATE UNIQUE(p1) - [IS] - (p2);

可以类似地写成:

<$ (p1:Person),(p2:Person {name:p1.name,email:p1.email})
WHERE p1<> p2
CREATE UNIQUE(p1) - [IS] - (p2);不用说,这是一个非常慢的数据库查询,大约有100,000个Person节点,特别是给定的Neo4j不会并行处理单个查询。


现在,我的问题是是否有更好的方式在Neo4j中运行这样的查询。我至少有八个专用于Neo4j的CPU核心,只要单独的线程不会锁定彼此所需的资源。



问题是我不知道Neo4j如何构建其Cypher执行计划。例如,假设我运行以下测试查询:

  MATCH(p1:Person),(p2:Person {name: p1.name,email:p1.email})
WHERE p1<> p2
RETURN p1,p2
LIMIT 100;

尽管有LIMIT子句,但Neo4j仍然需要相当长的时间才能找到结果,我想知道,即使对于这样一个有限的查询,Neo4j在考虑LIMIT语句之前生成整个笛卡尔积表。



感谢任何帮助,无论是解决这个特定问题,还是让我了解Neo4j通常如何构建Cypher执行计划(以及如何优化查询) 。您可以为 > p1 然后索引查找+比较 p2



请参阅: / p>

  cypher 2.1 
foreach(我在范围内(1,100000)|
create(:Person {name :John Doe+ str(i%10000),
email:john+ str(i%10000)+@ doe.com}));
+ ------------------- +
|没有数据返回。 |
+ ------------------- +
创建的节点:100000
属性设置:200000
添加的标签:100000
6543 ms
neo4j-sh(?)$ CREATE INDEX ON:Person(name);
+ ------------------- +
|没有数据返回。 |
+ ------------------- +
索引添加:1
28 ms

neo4j-sh (?)$ schema
索引
ON:人员(姓名)ONLINE

neo4j-sh(?)$
匹配(p1:Person)与p1 $ b (*)使用索引p2:Person(姓名)
其中p1<> p2 AND p2.email = p1.email
的$ b匹配(p2:Person {name:p1.name} ;
+ ---------- +
| count(*)|
+ ---------- +
| 900000 |
+ ---------- +
1行
8206 ms

neo4j-sh(?)$
匹配(p1 :Person)(姓名)
其中p1<> p2 AND p2.email = p1.email $ b使用p1
match(p2:Person {name:p1.name} $ b合并(p1) - [:IS] - (p2)
返回计数(*);

+ ---------- +
| count(*)|
+ ---------- +
| 900000 |
+ ---------- +
1行
创建的关系:450000
40256 ms


I am currently trying to merge three datasets for analysis purposes. I am using certain common fields to establish the connections between the datasets. In order to create the connections I have tried using the following type of query:

MATCH (p1:Person),(p2:Person)
WHERE p1.email = p2.email AND p1.name = p2.name AND p1 <> p2 
CREATE UNIQUE (p1)-[IS]-(p2);

Which can be similarly written as:

MATCH (p1:Person),(p2:Person {name:p1.name, email:p1.email})
WHERE p1 <> p2 
CREATE UNIQUE (p1)-[IS]-(p2);

Needless to say, this is a very slow query on a database with about 100,000 Person nodes, specially given that Neo4j does not process single queries in parallel.

Now, my question is whether there is any better way to run such queries in Neo4j. I have at least eight CPU cores to dedicate to Neo4j, as long as separate threads don't tie up by locking each others' required resources.

The issue is that I don't know how Neo4j builds its Cypher execution plans. For instance, let's say I run the following test query:

MATCH (p1:Person),(p2:Person {name:p1.name, email:p1.email})
WHERE p1 <> p2 
RETURN p1, p2
LIMIT 100;

Despite the LIMIT clause, Neo4j still takes a considerable amount of time to turn in the results, what makes me wonder whether even for such a limited query Neo4j produces the whole cartesian product table before considering the LIMIT statement.

I appreciate any help, whether it addresses this specific issue or just gives me an understanding of how Neo4j generally builds Cypher execution plans (and thus how to optimize queries in general). Can legacy Lucene indexes be of any help here?

解决方案

You can do a combination of a label scan for p1 and then index lookup + comparison for p2:

see here:

cypher 2.1 
foreach (i in range(1,100000) | 
  create (:Person {name:"John Doe"+str(i % 10000),
                   email:"john"+str(i % 10000)+"@doe.com"}));
+-------------------+
| No data returned. |
+-------------------+
Nodes created: 100000
Properties set: 200000
Labels added: 100000
6543 ms
neo4j-sh (?)$ CREATE INDEX ON :Person(name);
+-------------------+
| No data returned. |
+-------------------+
Indexes added: 1
28 ms

neo4j-sh (?)$ schema
Indexes
  ON :Person(name)  ONLINE

neo4j-sh (?)$ 
match (p1:Person) with p1 
match (p2:Person {name:p1.name}) using index p2:Person(name) 
where p1<>p2 AND p2.email = p1.email 
return count(*);
+----------+
| count(*) |
+----------+
| 900000   |
+----------+
1 row
8206 ms

neo4j-sh (?)$ 
match (p1:Person) with p1 
match (p2:Person {name:p1.name}) using index p2:Person(name) 
where p1<>p2 AND p2.email = p1.email
merge (p1)-[:IS]-(p2) 
return count(*);

+----------+
| count(*) |
+----------+
| 900000   |
+----------+
1 row
Relationships created: 450000
40256 ms

这篇关于如何使用多节点匹配来优化Neo4j Cypher查询(笛卡尔积)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆