使用 REST 优化 Neo4j 的大量批量插入 [英] Optimizing high volume batch inserts into Neo4j using REST

查看：46 发布时间：2021/12/28 17:15:52 java rest neo4j cypher batch-insert

本文介绍了使用 REST 优化 Neo4j 的大量批量插入的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我需要通过 REST API 的 Batch 端点将大量节点以及它们之间的关系插入 Neo4j，大约 5k 记录/秒(仍在增加).

I need to insert a huge amount of nodes with relationships between them into Neo4j via REST API's Batch endpoint, approx 5k records/s (still increasing).

这将是 24x7 连续插入.每条记录可能只需要创建一个节点，而其他记录可能需要创建两个节点和一个关系.

This will be continuous insertion 24x7. Each record may require creating one node only, but other may require two nodes and one relationship being created.

是否可以通过更改程序或修改 Neo4j 的设置来提高插入的性能?

Can I improve the performance of the inserts by changing my procedure or modifying the settings of Neo4j?

我目前的进展:

1.我已经用 Neo4j 测试了一段时间，但我无法获得我需要的性能

测试服务器盒:24 核 + 32GB RAM

Test server box: 24 cores + 32GB RAM

Neo4j 2.0.0-M06 作为独立服务安装.

Neo4j 2.0.0-M06 installed as a standalone service.

在同一台服务器上运行我的Java应用程序.(Neo4j和Java应用程序将来需要在自己的服务器上运行，所以不能使用嵌入式模式)

Running my Java application on the same server.(Neo4j and Java app will need to run on their own server in the future, so embedded mode can not be used)

REST API 端点:/db/data/batch(目标:/cypher)

REST API Endpoint : /db/data/batch (target: /cypher)

使用架构索引、约束、MERGE、CREATE UNIQUE.

Using schema index, constrains, MERGE, CREATE UNIQUE.

2.我的架构:

neo4j-sh (0)$ schema
==> Indexes
==>   ON :REPLY(created_at)   ONLINE                             
==>   ON :REPLY(ids)          ONLINE (for uniqueness constraint) 
==>   ON :REPOST(created_at) ONLINE                             
==>   ON :REPOST(ids)        ONLINE (for uniqueness constraint) 
==>   ON :Post(userId)      ONLINE                             
==>   ON :Post(postId)    ONLINE (for uniqueness constraint) 
==> 
==> Constraints
==>   ON (post:Post) ASSERT post.postId IS UNIQUE
==>   ON (repost:REPOST) ASSERT repost.ids IS UNIQUE
==>   ON (reply:REPLY) ASSERT reply.ids IS UNIQUE

3.我的密码查询和 JSON 请求

3.1.当一条记录需要创建单个节点时，工作描述如下

3.1. When one record requires single node creation, the job description looks like below

{"method" : "POST","to" : "/cypher","body" : {"query" : "MERGE (child:Post {postId:1001, userId:901})"}}

3.2.当一个记录需要创建一个关系的两个节点时，职位描述如下

3.2. When one record requires two nodes with one relationship to be created, the job description looks like below

{"method" : "POST","to" : "/cypher","body" : {"query" : "MERGE (parent:Post {postId:1002, userId:902}) MERGE (child:Post {postId:1003, userId:903}) CREATE UNIQUE parent-[relationship:REPOST {ids:'1002_1003', created_at:'Wed Nov 06 14:06:56 AST 2013' }]->child"}}

3.3.我通常每批发送 100 个工作描述(混合 3.1 和 3.2)，大约需要 150 到 250 毫秒才能完成.

3.3. I normally send 100 job descriptions (mixed 3.1 and 3.2) per batch which takes about 150~250ms to get it done.

4.性能问题

4.1. 并发:

/db/data/batch (target:/cypher) 似乎不是线程安全的，测试了两个或多个并发线程，这使得 Neo4j 服务器在几秒~几分钟内宕机.

/db/data/batch (target: /cypher) seems not thread safe, tested with two or more concurrent threads which brought Neo4j server down within second(s) ~ minute(s).

4.2. 带有约束的 MERGE 并不总是有效.

4.2. MERGE with constrains does not always work.

当使用单个查询创建两个节点和一个关系时(上面在 3.2. 中提到过)，它有时就像一个魅力；但有时它会因 CypherExecutionException 而失败，并说节点 xxxx 中的一个已经存在，标签为 aaaa 和属性bbbbb"=[ccccc];根据我的理解，MERGE 不假设返回任何异常，而是返回节点(如果它已经存在).

When creating two nodes and one relationship with a single query (mentioned above in 3.2.), it sometime works like a charm; but it sometime fails with a CypherExecutionException and saying one of the Node xxxx already exists with label aaaa and property "bbbbb"=[ccccc]; from my understanding, the MERGE is not suppose return any exception, but return the node if it already exist.

异常导致整批失败回滚，影响我的插入率.

As result of the exception, the whole batch will fail and roll-back, which affect my insert rate.

我已在 GitHub 中为此问题打开了一个问题，https://github.com/neo4j/neo4j/issues/1428

I have opened an issue in GitHub for this issue, https://github.com/neo4j/neo4j/issues/1428

4.3. CREATE UNIQUE 与约束并不总是适用于关系创建.

4.3. CREATE UNIQUE with constrains doesn't always work for relationship creation.

在同一个 github 问题中也提到了这一点.

This is mentioned in the same github issue too.

4.4. 性能:

实际上，在我使用带有密码的批处理之前，我已经尝试过使用 get_or_create (/db/data/index/node/Post?uniqueness=get_or_create &/db/data/index/relationship/XXXXX?uniqueness=get_or_create)

Actually, before I use batch with cypher, I have tried the legacy indexing with get_or_create (/db/data/index/node/Post?uniqueness=get_or_create & /db/data/index/relationship/XXXXX?uniqueness=get_or_create)

由于那些遗留索引端点的性质(它们返回索引中数据的位置而不是实际数据存储中的数据位置)，所以我不能在批处理中使用它们(需要早先创建的引用节点的功能同一批)

Because of the nature of those legacy index endpoints (they return location of the data in index instead location of the data in actual data storage), so I could not use them within batch (needed the feature of referring node created earlier in the same batch)

我知道我可以启用auto_indexing，直接处理数据存储而不是legacy index，但他们提到从2.0.0开始，schema index比legacy index更推荐，所以我决定改用batch + cypher + schema index方法.

I know I could enable auto_indexing, and deal with data storage directly instead of legacy index, but they mentioned from 2.0.0, schema index is recommended over legacy index, so I decide to switch to the batch + cypher + schema index approach.

然而，使用batch + cypher，我每秒只能得到大约200个工作描述的插入率，如果有约束的MERGE总是有效的话，它会高得多，假设大约600~800/s，但它仍然很多低于 5k/s.我还尝试了没有任何限制的模式索引，结果在插入率方面的性能甚至更低.

HOWEVER, with batch + cypher, I can only get about 200 job descriptions per second insert rate, it would have been much higher if the MERGE with constrains always worked, let's say about 600~800/s, but it's still much lower than 5k/s. I also tried schema index without any constrain, it ended up even lower performance in terms of insert rate.

使用 REST 优化 Neo4j 的大量批量插入 [英] Optimizing high volume batch inserts into Neo4j using REST

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录关闭

使用 REST 优化 Neo4j 的大量批量插入 [英] Optimizing high volume batch inserts into Neo4j using REST

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录 关闭

登录关闭