使用 Neo4j 导入工具在 Neo4j 中导入大型数据集需要很长时间(> 12 小时) [英] Import of large dataset in Neo4j taking really long (>12 hours) with Neo4j import tool

查看:25
本文介绍了使用 Neo4j 导入工具在 Neo4j 中导入大型数据集需要很长时间(> 12 小时)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个大型数据集(大约 1B 个节点和数十亿个关系),我正尝试将其导入 Neo4j.我正在使用 Neo4j 导入工具.节点在一个小时内完成了导入,但是从那以后,导入器在节点索引准备阶段(除非我错误地读取下面的输出)现在已经超过 12 个小时了.

I have a large dataset (about 1B nodes and a few billion relationships) that I am trying to import into Neo4j. I am using the Neo4j import tool. The nodes finished importing in an hour, however since then, the importer is stuck in a node index preparation phase (unless I am reading the output below incorrectly) for over 12 hours now.

...有效内存:可用机器内存:184.49 GB最大堆内存:26.52 GB

... Available memory: Free machine memory: 184.49 GB Max heap memory : 26.52 GB

节点[>:23.39 MB/s---|PROPERTIE|NODE:|LAB|*v:37.18 MB/s---------------------------------------------] 1B1h 7m 18s 54ms 内完成准备节点索引[*排序:11.52 GB---------------------------------------------------------------------------------]881M...

Nodes [>:23.39 MB/s---|PROPERTIE|NODE:|LAB|*v:37.18 MB/s---------------------------------------------] 1B Done in 1h 7m 18s 54ms Prepare node index [*SORT:11.52 GB--------------------------------------------------------------------------------]881M ...

我的问题是如何加快速度?我在想以下几点:1. 拆分节点和关系的导入命令并进行节点导入.2. 在节点上创建索引3. 进行合并/匹配以摆脱欺骗4. 做rels导入.

My question is how can I speed this up? I am thinking the following: 1. Split up the import command for nodes and relationships and do the nodes import. 2. Create indexes on the nodes 3. Do merge/match to get rid of dupes 4. Do rels import.

这会有帮助吗?还有什么我应该尝试的吗?堆大小是否太大(我认为不是,但想提出意见)?

Will this help? Is there something else I should try? Is the heap size too large (I think not, but would like an opinion)?

谢谢.

更新
我还尝试在同一台机器上导入一半的数据,它在大致相同的时间(按比例)再次卡在那个阶段.所以我基本上消除了磁盘空间和内存的问题.
我还检查了我的标题(因为我注意到其他人在标题不正确时遇到了这个问题),他们对我来说似乎是正确的.关于我还应该看什么的任何建议?

UPDATE
I also tried importing exactly half that data on the same machine and it gets stuck again in that phase at roughly the same amount of time (proportionally). So I have mostly eliminated disk space and memory as an issue.
I have also checked my headers (since I noticed that other people ran into this problem when they had incorrect headers) and they seem correct to me. Any suggestions on what I else should be looking at?

进一步更新
好吧,现在它变得有点荒谬了.我将数据大小减少到只有一个大文件(大约 3G).它只包含单一种类的节点,并且只有 id.所以数据看起来像这样

1|作者
2|作者
3|作者

和标题(在一个单独的文件中)看起来像这样

authorID:ID(作者)|:LABEL

我的导入仍然停留在排序阶段.我很确定我在这里做错了什么.但我真的不知道是什么.这是我调用此
的命令行/var/lib/neo4j/bin/neo4j-import --into data/db/graph.db --id-type string --delimiter "|"--bad-tolerance 1000000000 --skip-duplicate-nodes true --stacktrace true --ignore-empty-strings true --nodes:作者 "data/author/author_header_label.csv,data/author/author_half_label.csv.gz"

FURTHER UPDATE
Ok so now it is getting kind of ridiculous. I reduced my data size down to just one large file (about 3G). It only contains nodes of a single kind and only has ids. So the data looks something like this

1|Author
2|Author
3|Author

and the header (in a separate file) looks like this

authorID:ID(Author)|:LABEL

And my import still gets stuck in the sort phase. I am pretty sure I am doing something wrong here. But I really have no clue what. Here is my command line to invoke this
/var/lib/neo4j/bin/neo4j-import --into data/db/graph.db --id-type string --delimiter "|" --bad-tolerance 1000000000 --skip-duplicate-nodes true --stacktrace true --ignore-empty-strings true --nodes:Author "data/author/author_header_label.csv,data/author/author_half_label.csv.gz"


坏容忍和跳过重复节点的大多数选项都在那里,看看我是否可以至少以某种方式通过导入一次.


Most of the options for bad-tolerance and skip-duplicate-nodes are there to see if I can make it get through the import somehow at least once.

推荐答案

我想我发现了问题.我在这里使用了一些技巧
http://neo4j.com/developer/guide-import-csv/#_super_fast_batch_importer_for_huge_datasets 那里说我可以重复使用具有不同标题的相同 csv 文件——一次用于节点,一次用于关系.我低估了我使用的数据的 1-n (ness),导致 ID 上有很多重复.那个阶段基本上几乎都花在尝试排序和重复数据删除上.重新处理我的查询以提取拆分为节点和 rels 文件的数据,解决了该问题.感谢您查看这个!
所以基本上,理想情况下,每种类型的节点总是有单独的文件,rel 会给出最快的结果(至少在我的测试中).

I think I found the issue. I was using some of the tips here
http://neo4j.com/developer/guide-import-csv/#_super_fast_batch_importer_for_huge_datasets where it says I can re-use the same csv file with different headers -- once for nodes and once for relationships. I underestimated the 1-n (ness) of the data I was using, causing a lot of duplicates on the ID. That stage was basically almost all spent on trying to sort and then dedupe. Re-working my queries to extract the data split into nodes and rels files, fixed that problem. Thanks for looking into this!
So basically, ideally always having separate files for each type of node and rel will give fastest results (at least in my tests).

这篇关于使用 Neo4j 导入工具在 Neo4j 中导入大型数据集需要很长时间(> 12 小时)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆