使用cypher将节点插入neo4j数据库的最有效方法是什么 [英] What is the most efficient way to insert nodes into a neo4j database using cypher

查看:342
本文介绍了使用cypher将节点插入neo4j数据库的最有效方法是什么的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图通过使用py2neo python模块(py2neo.cypher.execute)执行cypher命令,将大量节点(〜500,000个)插入(非嵌入式)neo4j数据库.最终,我需要消除对py2neo的依赖,但是在我了解有关cypher和neo4j的更多信息之前,我一直在使用它.

我有两个节点类型A和B,绝大多数节点是类型A.有两个可能的关系r1和r2,即A- [r1] -A和A- [r2] -B.类型A的每个节点将具有0-100 r1关系,类型B的每个节点将具有1-5000 r2关系.

目前,我正在通过构建大型CREATE语句来插入节点.例如,我可能有一条声明

CREATE (:A {uid:1, attr:5})-[:r1]-(:A {uid:2, attr:5})-[:r1]-...

其中...可能是另外5000个左右的节点和关系,它们在图中形成了一条线性链.可以,但是很慢.我也在使用这些索引节点

CREATE INDEX ON :A(uid)

添加完所有类型A节点后,我再次使用CREATE语句添加类型B节点.最后,我尝试使用类似

的语句来添加r2关系

MATCH c:B, m:A where c.uid=1 AND (m.uid=2 OR m.uid=5 OR ...)
CREATE (m)-[:r2]->(c)

其中...可以代表几千个OR语句.每秒仅添加几个关系似乎真的很慢.

那么,有没有更好的方法可以做到这一点?我在这里完全偏离轨道吗?我看着这个问题,但是这并不能解释如何使用密码来有效地加载节点.我看过的所有其他内容似乎都在使用Java,而未显示可以使用实际的密码查询.

解决方案

直到结尾(在2.0版中)才创建索引.这会减慢节点的创建速度.

您是否在Cypher中使用参数?

我想您会浪费很多密码分析时间,除非每次您的密码与参数完全相同.如果您可以对此建模,则可以看到明显的性能提升.

您已经在密码请求中发送了相当大的数据块,但是批处理请求API可以让您发送多个REST请求,这可能会更快(尝试一下!).

最后,如果这是一次导入,则可以考虑使用批处理导入工具-即使在硬件较差的情况下,它也可以在几分钟内烧掉500K节点...然后可以升级数据库文件(我尚不认为它可以创建2.0文件,但是如果不能创建2.0文件,可能很快就会出现),然后通过Cypher创建标签/索引.

更新:我刚刚注意到您最后的MATCH声明.您不应该这样做-一次做一个关系,而不要对ID使用OR.这可能会很有帮助-并确保您为uid使用参数.即使使用索引提示,Cypher 2.0似乎也无法使用OR进行索引查找.也许这会在以后出现.

2013年12月更新:2.0具有Cypher事务处理终结点,在此方面,我已经看到了很大的吞吐量改进.使用"exec"大小为100-200的语句,事务大小总计为1000-10000的语句,我已经能够每秒发送20-30k的Cypher语句.对于加快通过Cypher的加载非常有效.

I'm trying to insert a large number of nodes (~500,000) into a (non-embedded) neo4j database by executing cypher commands using the py2neo python module (py2neo.cypher.execute). Eventually I need to remove the dependence on py2neo, but I'm using it at the moment until I learn more about cypher and neo4j.

I have two node types A and B, and the vast majority of nodes are of type A. There are two possible relationships r1 and r2, such that A-[r1]-A and A-[r2]-B. Each node of type A will have 0 - 100 r1 relationships, and each node of type B will have 1 - 5000 r2 relationships.

At the moment I am inserting nodes by building up large CREATE statements. For example I might have a statement

CREATE (:A {uid:1, attr:5})-[:r1]-(:A {uid:2, attr:5})-[:r1]-...

where ... might be another 5000 or so nodes and relationships forming a linear chain in the graph. This works okay, but it's pretty slow. I'm also indexing these nodes using

CREATE INDEX ON :A(uid)

After I've add all the type A nodes, I add the type B nodes using CREATE statements again. Finally, I am trying to add the r2 relationships using a statement like

MATCH c:B, m:A where c.uid=1 AND (m.uid=2 OR m.uid=5 OR ...)
CREATE (m)-[:r2]->(c)

where ... could represent a few thousand OR statements. This seems really slow adding only a few relationships per second.

So, is there a better way to do this? Am I completely off track here? I looked at this question but this doesn't explain how to use cypher to efficiently load the nodes. Everything else I look at seems to use java, without showing the actual cypher queries could be used.

解决方案

Don't create the index until the end (in 2.0). It will slow down node creation.

Are you using parameters in your Cypher?

I imagine you're losing a lot of cypher parsing time unless your cypher is exactly the same each time with parameters. If you can model it to be that, you'll see a marked performance increase.

You're already sending fairly hefty chunks in your cypher request, but the batch request API will let you send more than one in one REST request, which might be faster (try it!).

Finally, if this is a one time import, you might consider using the batch-import tool--it can burn through 500K nodes in a few minutes even on bad hardware... then you can upgrade the database files (I don't think it can create 2.0 files yet, but that may be coming shortly if not), and create your labels/index via Cypher.

Update: I just noticed your MATCH statement at the end. You shouldn't do it this way--do one relationship at a time instead of using the OR for the ids. This will probably help a lot--and make sure you use parameters for the uids. Cypher 2.0 doesn't seem to be able to do index lookups with OR, even when you use an index hint. Maybe this will come later.

Update Dec 2013: 2.0 has the Cypher transactional endpoint, which I've seen great throughput improvements on. I've been able to send 20-30k Cypher statements/second, using "exec" sizes of 100-200 statements, and transaction sizes of 1000-10000 statements total. Very effective for speeding up loading over Cypher.

这篇关于使用cypher将节点插入neo4j数据库的最有效方法是什么的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆