使用Neo4j导入工具在Neo4j中导入大型数据集会花费很长时间(> 12小时) [英] Import of large dataset in Neo4j taking really long (>12 hours) with Neo4j import tool

查看:166
本文介绍了使用Neo4j导入工具在Neo4j中导入大型数据集会花费很长时间(> 12小时)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个大型数据集(大约1B个节点和数十亿个关系),我正尝试将其导入Neo4j.我正在使用Neo4j导入工具.节点在一个小时内完成了导入,但是从那时起,导入器现在停留在节点索引准备阶段(除非我在下面错误地读取了下面的输出)超过了12个小时.

I have a large dataset (about 1B nodes and a few billion relationships) that I am trying to import into Neo4j. I am using the Neo4j import tool. The nodes finished importing in an hour, however since then, the importer is stuck in a node index preparation phase (unless I am reading the output below incorrectly) for over 12 hours now.

... 有效内存: 可用机器内存:184.49 GB 最大堆内存:26.52 GB

... Available memory: Free machine memory: 184.49 GB Max heap memory : 26.52 GB

节点 [>:23.39 MB/s --- | PROPERTIE | NODE:| LAB | * v:37.18 MB/s ------------------------- --------------------] 1B 在1h 7m 18s 54ms完成 准备节点索引 [*分类:11.52 GB -------------------------------------------- ------------------------------------] 881M ...

Nodes [>:23.39 MB/s---|PROPERTIE|NODE:|LAB|*v:37.18 MB/s---------------------------------------------] 1B Done in 1h 7m 18s 54ms Prepare node index [*SORT:11.52 GB--------------------------------------------------------------------------------]881M ...

我的问题是如何加快速度?我在想以下几点: 1.拆分用于节点和关系的导入命令,然后执行节点导入. 2.在节点上创建索引 3.进行合并/匹配以消除重复 4.不要导入.

My question is how can I speed this up? I am thinking the following: 1. Split up the import command for nodes and relationships and do the nodes import. 2. Create indexes on the nodes 3. Do merge/match to get rid of dupes 4. Do rels import.

这将对您有帮助吗?还有什么我应该尝试的吗?堆大小是否太大(我认为不是,但想提出意见)?

Will this help? Is there something else I should try? Is the heap size too large (I think not, but would like an opinion)?

谢谢.

更新
我还尝试在同一台计算机上导入恰好一半的数据,并且该数据在大致相同的时间(按比例)再次陷入该阶段.因此,我基本上消除了磁盘空间和内存问题.
我还检查了标头(因为我注意到其他人在标头不正确时会遇到此问题),并且它们对我来说似乎是正确的.关于我还应该看什么的任何建议?

UPDATE
I also tried importing exactly half that data on the same machine and it gets stuck again in that phase at roughly the same amount of time (proportionally). So I have mostly eliminated disk space and memory as an issue.
I have also checked my headers (since I noticed that other people ran into this problem when they had incorrect headers) and they seem correct to me. Any suggestions on what I else should be looking at?

进一步更新
好吧,现在变得有点荒谬了.我将数据大小减小到只有一个大文件(约3G).它仅包含单一种类的节点,并且仅具有ID.因此数据看起来像这样

1 |作者
2 |作者
3 |作者

标题(在单独的文件中)如下所示:

authorID:ID(作者)|:LABEL

而且我的导入仍然停留在排序阶段.我很确定自己在这里做错了.但是我真的不知道是什么.这是我调用此命令行的
/var/lib/neo4j/bin/neo4j-import --into data/db/graph.db --id-type字符串--delimiter"|" \ --bad-tolerance 1000000000 --skip-duplicate-nodes是--stacktrace是--ignore-empty-strings是\ --nodes:作者"data/author/author_header_label.csv,data/author/author_half_label.csv.gz"

FURTHER UPDATE
Ok so now it is getting kind of ridiculous. I reduced my data size down to just one large file (about 3G). It only contains nodes of a single kind and only has ids. So the data looks something like this

1|Author
2|Author
3|Author

and the header (in a separate file) looks like this

authorID:ID(Author)|:LABEL

And my import still gets stuck in the sort phase. I am pretty sure I am doing something wrong here. But I really have no clue what. Here is my command line to invoke this
/var/lib/neo4j/bin/neo4j-import --into data/db/graph.db --id-type string --delimiter "|" \ --bad-tolerance 1000000000 --skip-duplicate-nodes true --stacktrace true --ignore-empty-strings true \ --nodes:Author "data/author/author_header_label.csv,data/author/author_half_label.csv.gz"


容错和跳过重复节点的大多数选项都在那儿,以查看是否可以使它至少以某种方式通过导入.


Most of the options for bad-tolerance and skip-duplicate-nodes are there to see if I can make it get through the import somehow at least once.

推荐答案

我认为我发现了问题.我在这里使用了一些提示
http://neo4j.com/developer/guide-import-csv/#_ super_fast_batch_importer_for_huge_datasets 在其中说我可以重复使用具有不同头的同一csv文件-一次用于节点,一次用于关系.我低估了我使用的数据的1-n(强度),导致ID上有很多重复项.在那个阶段,基本上所有的时间都花在了尝试排序然后进行重复数据删除上.重新处理我的查询以提取拆分为节点并关联文件的数据,从而解决了该问题.感谢您查看这个!
因此,基本上,理想情况下,对于每种类型的节点和rel总是始终有单独的文件将提供最快的结果(至少在我的测试中).

I think I found the issue. I was using some of the tips here
http://neo4j.com/developer/guide-import-csv/#_super_fast_batch_importer_for_huge_datasets where it says I can re-use the same csv file with different headers -- once for nodes and once for relationships. I underestimated the 1-n (ness) of the data I was using, causing a lot of duplicates on the ID. That stage was basically almost all spent on trying to sort and then dedupe. Re-working my queries to extract the data split into nodes and rels files, fixed that problem. Thanks for looking into this!
So basically, ideally always having separate files for each type of node and rel will give fastest results (at least in my tests).

这篇关于使用Neo4j导入工具在Neo4j中导入大型数据集会花费很长时间(> 12小时)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆