使用 REST 优化 Neo4j 的大量批量插入 [英] Optimizing high volume batch inserts into Neo4j using REST

查看:46
本文介绍了使用 REST 优化 Neo4j 的大量批量插入的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要通过 REST API 的 Batch 端点将大量节点以及它们之间的关系插入 Neo4j,大约 5k 记录/秒(仍在增加).

I need to insert a huge amount of nodes with relationships between them into Neo4j via REST API's Batch endpoint, approx 5k records/s (still increasing).

这将是 24x7 连续插入.每条记录可能只需要创建一个节点,而其他记录可能需要创建两个节点和一个关系.

This will be continuous insertion 24x7. Each record may require creating one node only, but other may require two nodes and one relationship being created.

是否可以通过更改程序或修改 Neo4j 的设置来提高插入的性能?

Can I improve the performance of the inserts by changing my procedure or modifying the settings of Neo4j?

我目前的进展:

1.我已经用 Neo4j 测试了一段时间,但我无法获得我需要的性能

测试服务器盒:24 核 + 32GB RAM

Test server box: 24 cores + 32GB RAM

Neo4j 2.0.0-M06 作为独立服务安装.

Neo4j 2.0.0-M06 installed as a standalone service.

在同一台服务器上运行我的Java应用程序.(Neo4j和Java应用程序将来需要在自己的服务器上运行,所以不能使用嵌入式模式)

Running my Java application on the same server.(Neo4j and Java app will need to run on their own server in the future, so embedded mode can not be used)

REST API 端点:/db/data/batch(目标:/cypher)

REST API Endpoint : /db/data/batch (target: /cypher)

使用架构索引、约束、MERGE、CREATE UNIQUE.

Using schema index, constrains, MERGE, CREATE UNIQUE.

2.我的架构:

neo4j-sh (0)$ schema
==> Indexes
==>   ON :REPLY(created_at)   ONLINE                             
==>   ON :REPLY(ids)          ONLINE (for uniqueness constraint) 
==>   ON :REPOST(created_at) ONLINE                             
==>   ON :REPOST(ids)        ONLINE (for uniqueness constraint) 
==>   ON :Post(userId)      ONLINE                             
==>   ON :Post(postId)    ONLINE (for uniqueness constraint) 
==> 
==> Constraints
==>   ON (post:Post) ASSERT post.postId IS UNIQUE
==>   ON (repost:REPOST) ASSERT repost.ids IS UNIQUE
==>   ON (reply:REPLY) ASSERT reply.ids IS UNIQUE

3.我的密码查询和 JSON 请求

3.1.当一条记录需要创建单个节点时,工作描述如下

3.1. When one record requires single node creation, the job description looks like below

{"method" : "POST","to" : "/cypher","body" : {"query" : "MERGE (child:Post {postId:1001, userId:901})"}}

3.2.当一个记录需要创建一个关系的两个节点时,职位描述如下

3.2. When one record requires two nodes with one relationship to be created, the job description looks like below

{"method" : "POST","to" : "/cypher","body" : {"query" : "MERGE (parent:Post {postId:1002, userId:902}) MERGE (child:Post {postId:1003, userId:903}) CREATE UNIQUE parent-[relationship:REPOST {ids:'1002_1003', created_at:'Wed Nov 06 14:06:56 AST 2013' }]->child"}}

3.3.我通常每批发送 100 个工作描述(混合 3.1 和 3.2),大约需要 150 到 250 毫秒才能完成.

3.3. I normally send 100 job descriptions (mixed 3.1 and 3.2) per batch which takes about 150~250ms to get it done.

4.性能问题

4.1. 并发:

/db/data/batch (target:/cypher) 似乎不是线程安全的,测试了两个或多个并发线程,这使得 Neo4j 服务器在几秒~几分钟内宕机.

/db/data/batch (target: /cypher) seems not thread safe, tested with two or more concurrent threads which brought Neo4j server down within second(s) ~ minute(s).

4.2. 带有约束的 MERGE 并不总是有效.

4.2. MERGE with constrains does not always work.

当使用单个查询创建两个节点和一个关系时(上面在 3.2. 中提到过),它有时就像一个魅力;但有时它会因 CypherExecutionException 而失败,并说节点 xxxx 中的一个已经存在,标签为 aaaa 和属性bbbbb"=[ccccc];根据我的理解,MERGE 不假设返回任何异常,而是返回节点(如果它已经存在).

When creating two nodes and one relationship with a single query (mentioned above in 3.2.), it sometime works like a charm; but it sometime fails with a CypherExecutionException and saying one of the Node xxxx already exists with label aaaa and property "bbbbb"=[ccccc]; from my understanding, the MERGE is not suppose return any exception, but return the node if it already exist.

异常导致整批失败回滚,影响我的插入率.

As result of the exception, the whole batch will fail and roll-back, which affect my insert rate.

我已在 GitHub 中为此问题打开了一个问题,https://github.com/neo4j/neo4j/issues/1428

I have opened an issue in GitHub for this issue, https://github.com/neo4j/neo4j/issues/1428

4.3. CREATE UNIQUE 与约束并不总是适用于关系创建.

4.3. CREATE UNIQUE with constrains doesn't always work for relationship creation.

在同一个 github 问题中也提到了这一点.

This is mentioned in the same github issue too.

4.4. 性能:

实际上,在我使用带有密码的批处理之前,我已经尝试过使用 get_or_create (/db/data/index/node/Post?uniqueness=get_or_create &/db/data/index/relationship/XXXXX?uniqueness=get_or_create)

Actually, before I use batch with cypher, I have tried the legacy indexing with get_or_create (/db/data/index/node/Post?uniqueness=get_or_create & /db/data/index/relationship/XXXXX?uniqueness=get_or_create)

由于那些遗留索引端点的性质(它们返回索引中数据的位置而不是实际数据存储中的数据位置),所以我不能在批处理中使用它们(需要早先创建的引用节点的功能同一批)

Because of the nature of those legacy index endpoints (they return location of the data in index instead location of the data in actual data storage), so I could not use them within batch (needed the feature of referring node created earlier in the same batch)

我知道我可以启用auto_indexing,直接处理数据存储而不是legacy index,但他们提到从2.0.0开始,schema index比legacy index更推荐,所以我决定改用batch + cypher + schema index方法.

I know I could enable auto_indexing, and deal with data storage directly instead of legacy index, but they mentioned from 2.0.0, schema index is recommended over legacy index, so I decide to switch to the batch + cypher + schema index approach.

然而,使用batch + cypher,我每秒只能得到大约200个工作描述的插入率,如果有约束的MERGE总是有效的话,它会高得多,假设大约600~800/s,但它仍然很多低于 5k/s.我还尝试了没有任何限制的模式索引,结果在插入率方面的性能甚至更低.

HOWEVER, with batch + cypher, I can only get about 200 job descriptions per second insert rate, it would have been much higher if the MERGE with constrains always worked, let's say about 600~800/s, but it's still much lower than 5k/s. I also tried schema index without any constrain, it ended up even lower performance in terms of insert rate.

推荐答案

在 2.0 中,我将使用事务端点批量创建您的语句,例如每个 http 请求 100 或 1000 个,每个事务大约 30k-50k(直到您提交).

With 2.0 I would use the transactional endpoint to create your statements in batches, e.g. 100 or 1000 per http request and about 30k-50k per transaction (until you commit).

有关新的流式传输事务端点的格式,请参阅此内容:

See this for the format of the new streaming, transactional endpoint:

http://docs.neo4j.org/chunked/milestone/rest-api-transactional.html

同样对于如此高性能、连续插入的端点,我衷心建议编写一个服务器扩展,该扩展将针对嵌入式 API 运行,并且可以轻松地每秒插入 10k 或更多节点和关系,请参阅此处的文档:

Also for such a high performance, continuous insertion endpoint I heartily recommend writing a server extension which would run against the embedded API and can easily insert 10k or more nodes and relationships per second, see here for the documentation:

http://docs.neo4j.org/chunked/milestone/server-unmanaged-extensions.html

对于纯插入,您不需要 Cypher.对于并发,只需在众所周知的(您插入的每个子图)节点上加锁,这样并发插入就没有问题了,您可以使用 tx.acquireWriteLock() 或通过删除节点中不存在的属性 (REMOVE n.__lock__).

For pure inserts you don't need Cypher. And for concurrency, just take a lock at a well known (per subgraph that you are inserting) node so that concurrent inserts are no issue, you can do that with tx.acquireWriteLock() or by removing a non-existent property from a node (REMOVE n.__lock__).

关于编写非托管扩展的另一个示例(但使用密码的),请查看此项目.它甚至还有一种可能对您有所帮助的模式(将 CSV 文件发布到要使用每行密码语句执行的服务器端点).

For another example of writing an unmanaged extension (but one that uses cypher), check out this project. It even has a mode that might help you (POSTing CSV files to the server endpoint to be executed using a cypher statement per row).

https://github.com/jexp/cypher-rs

这篇关于使用 REST 优化 Neo4j 的大量批量插入的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆