使用REST优化Neo4j中的大批量批处理插入 [英] Optimizing high volume batch inserts into Neo4j using REST

查看：109 发布时间：2020/5/16 23:47:04 java rest neo4j cypher batch-insert

本文介绍了使用REST优化Neo4j中的大批量批处理插入的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我需要通过REST API的Batch端点将大量节点及其之间的关系插入到Neo4j中，大约每秒5k记录(仍在增加).

I need to insert a huge amount of nodes with relationships between them into Neo4j via REST API's Batch endpoint, approx 5k records/s (still increasing).

这将是24x7连续插入.每条记录可能只需要创建一个节点，而其他记录可能需要创建两个节点和一个关系.

This will be continuous insertion 24x7. Each record may require creating one node only, but other may require two nodes and one relationship being created.

是否可以通过更改程序或修改Neo4j的设置来提高插入的性能?

Can I improve the performance of the inserts by changing my procedure or modifying the settings of Neo4j?

我到目前为止的进展:

1.我已经在Neo4j上进行了一段时间的测试，但无法获得所需的性能

测试服务器盒:24核+ 32GB RAM

Test server box: 24 cores + 32GB RAM

作为独立服务安装的Neo4j 2.0.0-M06.

Neo4j 2.0.0-M06 installed as a standalone service.

在同一服务器上运行我的Java应用程序.(Neo4j和Java应用程序将来将需要在自己的服务器上运行，因此不能使用嵌入式模式)

Running my Java application on the same server.(Neo4j and Java app will need to run on their own server in the future, so embedded mode can not be used)

REST API端点:/db/data/batch(目标:/cypher)

REST API Endpoint : /db/data/batch (target: /cypher)

使用模式索引，约束，合并，创建唯一.

Using schema index, constrains, MERGE, CREATE UNIQUE.

2.我的架构:

neo4j-sh (0)$ schema
==> Indexes
==>   ON :REPLY(created_at)   ONLINE                             
==>   ON :REPLY(ids)          ONLINE (for uniqueness constraint) 
==>   ON :REPOST(created_at) ONLINE                             
==>   ON :REPOST(ids)        ONLINE (for uniqueness constraint) 
==>   ON :Post(userId)      ONLINE                             
==>   ON :Post(postId)    ONLINE (for uniqueness constraint) 
==> 
==> Constraints
==>   ON (post:Post) ASSERT post.postId IS UNIQUE
==>   ON (repost:REPOST) ASSERT repost.ids IS UNIQUE
==>   ON (reply:REPLY) ASSERT reply.ids IS UNIQUE

3.我的密码查询和JSON请求

3.1..当一条记录需要创建单个节点时，作业说明如下所示

3.1. When one record requires single node creation, the job description looks like below

{"method" : "POST","to" : "/cypher","body" : {"query" : "MERGE (child:Post {postId:1001, userId:901})"}}

3.2.当一条记录要求创建两个具有一种关系的节点时，作业说明如下所示

3.2. When one record requires two nodes with one relationship to be created, the job description looks like below

{"method" : "POST","to" : "/cypher","body" : {"query" : "MERGE (parent:Post {postId:1002, userId:902}) MERGE (child:Post {postId:1003, userId:903}) CREATE UNIQUE parent-[relationship:REPOST {ids:'1002_1003', created_at:'Wed Nov 06 14:06:56 AST 2013' }]->child"}}

3.3.我通常每批发送100个工作描述(3.1和3.2混合)，大约需要150〜250ms才能完成.

3.3. I normally send 100 job descriptions (mixed 3.1 and 3.2) per batch which takes about 150~250ms to get it done.

4.性能问题

4.1.并发:

/db/data/batch(目标:/cypher)似乎不是线程安全的，已通过两个或多个并发线程进行了测试，这些线程使Neo4j服务器在数秒至数分钟内停机.

/db/data/batch (target: /cypher) seems not thread safe, tested with two or more concurrent threads which brought Neo4j server down within second(s) ~ minute(s).

4.2.带有约束的合并并不总是有效.

4.2. MERGE with constrains does not always work.

当用单个查询创建两个节点和一个关系时(上面在3.2.中提到)，它有时就像一个符咒；但是它有时会因CypherExecutionException而失败，并说节点xxxx之一已经存在，其标签为aaaa，属性为"bbbbb" = [ccccc]；根据我的理解，MERGE不是假定返回任何异常，而是返回该节点(如果已经存在).

When creating two nodes and one relationship with a single query (mentioned above in 3.2.), it sometime works like a charm; but it sometime fails with a CypherExecutionException and saying one of the Node xxxx already exists with label aaaa and property "bbbbb"=[ccccc]; from my understanding, the MERGE is not suppose return any exception, but return the node if it already exist.

由于异常，整个批次将失败并回滚，这会影响我的插入率.

As result of the exception, the whole batch will fail and roll-back, which affect my insert rate.

我已经在GitHub上针对此问题打开了一个问题， https://github.com/neo4j /neo4j/issues/1428

I have opened an issue in GitHub for this issue, https://github.com/neo4j/neo4j/issues/1428

4.3.带有约束的创建唯一性"并不总是可以用于创建关系.

4.3. CREATE UNIQUE with constrains doesn't always work for relationship creation.

在同一个github问题中也提到了这一点.

This is mentioned in the same github issue too.

4.4.性能:

实际上，在将批处理与密码一起使用之前，我曾尝试使用get_or_create(/db/data/index/node/Post/uniqueness = get_or_create&/db/data/index/relationship/XXXXX?uniqueness = get_or_create)

Actually, before I use batch with cypher, I have tried the legacy indexing with get_or_create (/db/data/index/node/Post?uniqueness=get_or_create & /db/data/index/relationship/XXXXX?uniqueness=get_or_create)

由于这些旧索引端点的性质(它们在索引中返回数据的位置，而不是实际数据存储中数据的位置)，所以我不能在批处理中使用它们(需要在先前创建的引用节点的功能)同一批次)

Because of the nature of those legacy index endpoints (they return location of the data in index instead location of the data in actual data storage), so I could not use them within batch (needed the feature of referring node created earlier in the same batch)

我知道我可以启用auto_indexing，并直接处理数据存储而不是旧索引，但是他们从2.0.0开始提到，建议在旧索引上使用架构索引，因此我决定切换到批处理+密码+架构索引方法.

I know I could enable auto_indexing, and deal with data storage directly instead of legacy index, but they mentioned from 2.0.0, schema index is recommended over legacy index, so I decide to switch to the batch + cypher + schema index approach.

但是，使用批处理+密码，我每秒只能获得大约200个工作描述的插入率，如果带有约束的MERGE总是可以工作的话，那就更高了，比如大约600〜800/s，但仍然很多低于5k/s. 我还尝试了没有任何限制的模式索引，但就插入率而言，它甚至导致了更低的性能.

HOWEVER, with batch + cypher, I can only get about 200 job descriptions per second insert rate, it would have been much higher if the MERGE with constrains always worked, let's say about 600~800/s, but it's still much lower than 5k/s. I also tried schema index without any constrain, it ended up even lower performance in terms of insert rate.

使用REST优化Neo4j中的大批量批处理插入 [英] Optimizing high volume batch inserts into Neo4j using REST

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录关闭

使用REST优化Neo4j中的大批量批处理插入 [英] Optimizing high volume batch inserts into Neo4j using REST

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录 关闭

登录关闭