使用REST优化Neo4j中的大批量批处理插入 [英] Optimizing high volume batch inserts into Neo4j using REST

查看:109
本文介绍了使用REST优化Neo4j中的大批量批处理插入的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要通过REST API的Batch端点将大量节点及其之间的关系插入到Neo4j中,大约每秒5k记录(仍在增加).

I need to insert a huge amount of nodes with relationships between them into Neo4j via REST API's Batch endpoint, approx 5k records/s (still increasing).

这将是24x7连续插入.每条记录可能只需要创建一个节点,而其他记录可能需要创建两个节点和一个关系.

This will be continuous insertion 24x7. Each record may require creating one node only, but other may require two nodes and one relationship being created.

是否可以通过更改程序或修改Neo4j的设置来提高插入的性能?

Can I improve the performance of the inserts by changing my procedure or modifying the settings of Neo4j?

我到目前为止的进展:

1.我已经在Neo4j上进行了一段时间的测试,但无法获得所需的性能

测试服务器盒:24核+ 32GB RAM

Test server box: 24 cores + 32GB RAM

作为独立服务安装的Neo4j 2.0.0-M06.

Neo4j 2.0.0-M06 installed as a standalone service.

在同一服务器上运行我的Java应用程序.(Neo4j和Java应用程序将来将需要在自己的服务器上运行,因此不能使用嵌入式模式)

Running my Java application on the same server.(Neo4j and Java app will need to run on their own server in the future, so embedded mode can not be used)

REST API端点:/db/data/batch(目标:/cypher)

REST API Endpoint : /db/data/batch (target: /cypher)

使用模式索引,约束,合并,创建唯一.

Using schema index, constrains, MERGE, CREATE UNIQUE.

2.我的架构:

neo4j-sh (0)$ schema
==> Indexes
==>   ON :REPLY(created_at)   ONLINE                             
==>   ON :REPLY(ids)          ONLINE (for uniqueness constraint) 
==>   ON :REPOST(created_at) ONLINE                             
==>   ON :REPOST(ids)        ONLINE (for uniqueness constraint) 
==>   ON :Post(userId)      ONLINE                             
==>   ON :Post(postId)    ONLINE (for uniqueness constraint) 
==> 
==> Constraints
==>   ON (post:Post) ASSERT post.postId IS UNIQUE
==>   ON (repost:REPOST) ASSERT repost.ids IS UNIQUE
==>   ON (reply:REPLY) ASSERT reply.ids IS UNIQUE

3.我的密码查询和JSON请求

3.1..当一条记录需要创建单个节点时,作业说明如下所示

3.1. When one record requires single node creation, the job description looks like below

{"method" : "POST","to" : "/cypher","body" : {"query" : "MERGE (child:Post {postId:1001, userId:901})"}}

3.2.当一条记录要求创建两个具有一种关系的节点时,作业说明如下所示

3.2. When one record requires two nodes with one relationship to be created, the job description looks like below

{"method" : "POST","to" : "/cypher","body" : {"query" : "MERGE (parent:Post {postId:1002, userId:902}) MERGE (child:Post {postId:1003, userId:903}) CREATE UNIQUE parent-[relationship:REPOST {ids:'1002_1003', created_at:'Wed Nov 06 14:06:56 AST 2013' }]->child"}}

3.3.我通常每批发送100个工作描述(3.1和3.2混合),大约需要150〜250ms才能完成.

3.3. I normally send 100 job descriptions (mixed 3.1 and 3.2) per batch which takes about 150~250ms to get it done.

4.性能问题

4.1.并发:

/db/data/batch(目标:/cypher)似乎不是线程安全的,已通过两个或多个并发线程进行了测试,这些线程使Neo4j服务器在数秒至数分钟内停机.

/db/data/batch (target: /cypher) seems not thread safe, tested with two or more concurrent threads which brought Neo4j server down within second(s) ~ minute(s).

4.2.带有约束的合并并不总是有效.

4.2. MERGE with constrains does not always work.

当用单个查询创建两个节点和一个关系时(上面在3.2.中提到),它有时就像一个符咒;但是它有时会因CypherExecutionException而失败,并说节点xxxx之一已经存在,其标签为aaaa,属性为"bbbbb" = [ccccc];根据我的理解,MERGE不是假定返回任何异常,而是返回该节点(如果已经存在).

When creating two nodes and one relationship with a single query (mentioned above in 3.2.), it sometime works like a charm; but it sometime fails with a CypherExecutionException and saying one of the Node xxxx already exists with label aaaa and property "bbbbb"=[ccccc]; from my understanding, the MERGE is not suppose return any exception, but return the node if it already exist.

由于异常,整个批次将失败并回滚,这会影响我的插入率.

As result of the exception, the whole batch will fail and roll-back, which affect my insert rate.

我已经在GitHub上针对此问题打开了一个问题, https://github.com/neo4j /neo4j/issues/1428

I have opened an issue in GitHub for this issue, https://github.com/neo4j/neo4j/issues/1428

4.3.带有约束的创建唯一性"并不总是可以用于创建关系.

4.3. CREATE UNIQUE with constrains doesn't always work for relationship creation.

在同一个github问题中也提到了这一点.

This is mentioned in the same github issue too.

4.4.性能:

实际上,在将批处理与密码一起使用之前,我曾尝试使用get_or_create(/db/data/index/node/Post/uniqueness = get_or_create&/db/data/index/relationship/XXXXX?uniqueness = get_or_create)

Actually, before I use batch with cypher, I have tried the legacy indexing with get_or_create (/db/data/index/node/Post?uniqueness=get_or_create & /db/data/index/relationship/XXXXX?uniqueness=get_or_create)

由于这些旧索引端点的性质(它们在索引中返回数据的位置,而不是实际数据存储中数据的位置),所以我不能在批处理中使用它们(需要在先前创建的引用节点的功能)同一批次)

Because of the nature of those legacy index endpoints (they return location of the data in index instead location of the data in actual data storage), so I could not use them within batch (needed the feature of referring node created earlier in the same batch)

我知道我可以启用auto_indexing,并直接处理数据存储而不是旧索引,但是他们从2.0.0开始提到,建议在旧索引上使用架构索引,因此我决定切换到批处理+密码+架构索引方法.

I know I could enable auto_indexing, and deal with data storage directly instead of legacy index, but they mentioned from 2.0.0, schema index is recommended over legacy index, so I decide to switch to the batch + cypher + schema index approach.

但是,使用批处理+密码,我每秒只能获得大约200个工作描述的插入率,如果带有约束的MERGE总是可以工作的话,那就更高了,比如大约600〜800/s,但仍然很多低于5k/s. 我还尝试了没有任何限制的模式索引,但就插入率而言,它甚至导致了更低的性能.

HOWEVER, with batch + cypher, I can only get about 200 job descriptions per second insert rate, it would have been much higher if the MERGE with constrains always worked, let's say about 600~800/s, but it's still much lower than 5k/s. I also tried schema index without any constrain, it ended up even lower performance in terms of insert rate.

推荐答案

在2.0版本中,我将使用事务终结点批量创建您的语句,例如每个http请求100或1000,每个事务大约3万至5万(直到您提交).

With 2.0 I would use the transactional endpoint to create your statements in batches, e.g. 100 or 1000 per http request and about 30k-50k per transaction (until you commit).

有关新的流式交易端点的格式,请参见以下内容:

See this for the format of the new streaming, transactional endpoint:

http://docs.neo4j.org/chunked/milestone/rest-api-transactional.html

同样,对于这样的高性能,连续插入端点,我衷心建议编写一个服务器扩展程序,该扩展程序将针对嵌入式API运行,并且可以轻松地每秒插入10k或更多的节点和关系,请参见此处获取文档:

Also for such a high performance, continuous insertion endpoint I heartily recommend writing a server extension which would run against the embedded API and can easily insert 10k or more nodes and relationships per second, see here for the documentation:

http://docs.neo4j.org/chunked/milestone/server-unmanaged-extensions.html

对于纯插入物,您不需要Cypher.对于并发,只需锁定一个已知的(要插入的每个子图)节点,以便并发插入没有问题,可以使用tx.acquireWriteLock()或通过从节点中删除不存在的属性来实现(REMOVE n.__lock__).

For pure inserts you don't need Cypher. And for concurrency, just take a lock at a well known (per subgraph that you are inserting) node so that concurrent inserts are no issue, you can do that with tx.acquireWriteLock() or by removing a non-existent property from a node (REMOVE n.__lock__).

有关编写非托管扩展(但使用cypher的扩展)的另一个示例,请签出此项目.它甚至具有一种可能对您有帮助的模式(将CSV文件发布到服务器端点以使用每行的cypher语句来执行).

For another example of writing an unmanaged extension (but one that uses cypher), check out this project. It even has a mode that might help you (POSTing CSV files to the server endpoint to be executed using a cypher statement per row).

https://github.com/jexp/cypher-rs

这篇关于使用REST优化Neo4j中的大批量批处理插入的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆