Neo4j 数据导入缓慢 [英] Neo4j Data Import Slowness
问题描述
我必须在 Neo4j DB 中加载大约 500 万条记录,所以我将 excel 分成了 100K 块,数据采用表格格式,我为此使用了 CyperShell,但似乎已经超过 8 小时,但它仍然存在卡在第一个块
I have to load around 5M Records in the Neo4j DB so I broke the excel into the chunks of 100K the Data is in Tabular Format and I am using CyperShell for that but seems like it has been more than 8 hours and it's still stuck on the first chunk
我正在使用
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS from 'file://aa.xlsx' as row
MERGE (p1:L1 {Name: row.sl1})
MERGE (p2:L2 {Name: row.sl2})
MERGE (p3:L3 {Name: row.sl3, Path:row.sl3a})
MERGE (p4:L4 {Name: row.sl4})
MERGE (p5:L4 {Name: row.tl1})
MERGE (p6:L3 {Name: row.tl2})
MERGE (p7:L2 {Name: row.tl3, Path:row.tl3a})
MERGE (p8:L1 {Name: row.tl4})
MERGE (p1)-[:s]->(p2)-[:s]->(p3)-[:s]->(p4)-[:it]->(p5)-[:t]->(p6)-[:t]->(p7)-[:t]->(p8)
任何人都可以建议我更改或替代方法以更快的方式加载数据
Can Anyone Suggest me the changes or alternate Method to load the data in faster way
Excel 格式的数据
Data in Excel Format
推荐答案
对于导入大量数据,您应该考虑使用import 工具而不是 Cypher 的
LOAD CSV
子句.该工具只能导入以前未使用的数据库.
For importing a large amount of data, you should consider using the import tool instead of Cypher's
LOAD CSV
clause. That tool can only import into a previously unused database.
如果您仍想使用LOAD CSV
,则需要进行一些更改.
If you still want to use LOAD CSV
, you need to make some changes.
You are using MERGE improperly, and are probably generating many duplicate nodes and relationships as a result. You may find this answer instructive.
MERGE
子句的整个模式该模式尚不存在.
A MERGE
clause's entire pattern will be created if anything in
the pattern does not already exist.
因此,您的最后一个 MERGE
模式及其七个关系特别危险.它应该被分成七个具有独立关系的 MERGE
子句.
So, your last MERGE
pattern, with its seven relationships, is especially dangerous. It should be split into seven MERGE
clauses with individual relationships.
此外,指定多个属性的 MERGE
模式也可能很糟糕.例如,如果所有 L3
节点都有唯一的 Name
值,那么替换它会更安全:
Also, a MERGE
pattern that specifies multiple properties is likely bad as well. For example, if all L3
nodes have a unique Name
value, then it would be safer to replace this:
MERGE (p3:L3 {Name: row.sl3, Path:row.sl3a})
类似于以下内容:
MERGE (p3:L3 {Name: row.sl3})
ON CREATE SET p3.Path = row.sl3a
在上面的代码片段中,如果节点已经存在但 row.sl3a
与现有的 Path
值不同,则不会创建其他节点.另外,由于节点已经存在,ON CREATE 选项不执行其 SET
子句,保持原来的 Path
值不变.您也可以选择使用 ONMATCH 代替,或者如果您无论如何都想设置值,甚至可以直接调用 SET
.
In the above snippet, if the node already exists but row.sl3a
is different than the existing Path
value, then no additional node is created. In addition, since the node already existed, the ON CREATE option does not execute its SET
clause, leaving the original Path
value unchanged. You could also choose to use ON MATCH instead, or even just call SET
directly if you want to set the value no matter what.
为了避免每次 MERGE
需要查找现有节点时都必须扫描具有给定标签的所有节点,您应该创建一个 index 或 唯一性约束,用于您MERGE的每个节点的每个标签/属性对
ing:
To avoid having to scanning through all the nodes with a given label every time MERGE
needs to find an existing node, you should create an index or uniqueness constraint for every label/property pair of every node that you are MERGE
ing:
:L1(Name)
:L2(Name)
:L3(Name)
:L4(Name)
这篇关于Neo4j 数据导入缓慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!