在Windows上使用Batch Import将数十亿个节点和关系导入Neo4j [英] Import billions of nodes and relationships to Neo4j using Batch Import on Windows

查看:373
本文介绍了在Windows上使用Batch Import将数十亿个节点和关系导入Neo4j的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想为Neo4j插入数十亿个节点和关系.尽管我有16GB的RAM,但由于工作内存已超载,浏览器(Chrome)在30分钟后取消了"LOAD CSV"的使用.

显然,可以使用批处理导入器将大型数据集导入Neo4j(文档和下载对Linux的解释).

要简单地使用它(不需要source/git/maven):

1. download 2.2 zip
2. unzip
3. run import.sh test.db nodes.csv rels.csv (on Windows: import.bat)
4. after the import point your /path/to/neo4j/conf/neo4j-server.properties 
to this test.db directory, or copy the data over to your server cp -r 
test.db/* /path/to/neo4j/data/graph.db/

You provide one tab separated csv file for nodes and one for 
relationships (optionally more for indexes)

我很难在Windows上使用该插件.在Rik Van Bruggen的Linux视频中(上面的链接),他提到批处理导入程序的安装".

我解压缩了文件"download 2.2 zip".我在另一个文件夹中有CSV文件.如何使用Windows文档中提到的"import.bat"命令?在cmd中找不到命令...

解决方案

在使用用于巨大数据集的工具之前,我可以建议您一些我刚刚学会的在几分钟之内导入数百万个节点的东西(适用于Windows的Neo4j社区版). /p>

关于Neo4j导入提示:

  • 不要使用Web界面导入如此大的数据集,不可避免的是内存过载.

  • 相反,请使用编程语言与Neo4j进行交互(我最近使用了官方的 Python 模块,它很容易学习,但是您可以使用古老的Java).

  • 在使用LOAD CSV之前,请记住编写USING PERIODIC COMMIT指令,以便每次迭代导入大数据集.

  • 在从CSV导入关系之前,请记住对标签的键属性使用CREATE CONSTRAINT ON <...> ASSERT <...> IS UNIQUE.这将对建立关系产生巨大的影响.

  • 对于关联过程,请使用MATCH(...),而不是CREATE(...).它将避免重复.

关于Neo4j性能:

我将发布用于配置的我的neo4j.conf自定义行(仅供参考,可能是 错误 的设置您的应用程序,请当心):

dbms.memory.pagecache.size=3g
dbms.jvm.additional=-XX:+UseG1GC
dbms.jvm.additional=-XX:-OmitStackTraceInFastThrow
dbms.jvm.additional=-XX:+AlwaysPreTouch
dbms.jvm.additional=-XX:+UnlockExperimentalVMOptions
dbms.jvm.additional=-XX:+TrustFinalNonStaticFields
dbms.jvm.additional=-XX:+DisableExplicitGC

我的neo4j-community.vmoptions自定义行(再次,仅供参考):

-Xmx1024m
-XX:+UseG1GC
-OmitStackTraceInFastThrow
-XX:+AlwaysPreTouch
-XX:+UnlockExperimentalVMOptions
-XX:+TrustFinalNonStaticFields
-XX:+DisableExplicitGC

我的测试机是一台性能较弱的笔记本电脑,配备了Core i3(双核),8GB RAM,Windows 10和Neo4j 3.2.1 Community Edition.

我能够在不到3分钟的时间内导入 700万个节点 ,并在不到5分钟的时间内导入 350万个关系 em> (无递归关系).

在功能更强大的计算机中,通过特定的精巧设置 Neo4j可以做得更好.希望对您有所帮助.

I want to insert a few billions nodes and relationships to Neo4j. Using "LOAD CSV" is being cancelled after 30 min by the browser (Chrome) as the working memory is overloaded, though I have 16GB RAM.

Large datasets apparently can be imported to Neo4j using the Batch Importer (Documentation & Download, Explanation for Linux ).

To simply use it (no source/git/maven required):

1. download 2.2 zip
2. unzip
3. run import.sh test.db nodes.csv rels.csv (on Windows: import.bat)
4. after the import point your /path/to/neo4j/conf/neo4j-server.properties 
to this test.db directory, or copy the data over to your server cp -r 
test.db/* /path/to/neo4j/data/graph.db/

You provide one tab separated csv file for nodes and one for 
relationships (optionally more for indexes)

I struggle to use the plugin on Windows. In the Linux-Video by Rik Van Bruggen (link above) he mentions "installation of the batch importer".

I unzipped the file "download 2.2 zip". I have my CSVs in another folder. How do I use the "import.bat" command mentioned in the Documentation on WIndows? In cmd the command can't be found...

解决方案

Before using the tool for gigantic datasets, I can suggest you few things I just learned importing millions of nodes in few minutes (Neo4j Community Edition for Windows).

Regarding Neo4j import tips:

  • Don't use the web interface to import such big datasets, memory overload is inevitable.

  • Instead, use a programming language to interact with Neo4j (I recently used the official Python module and it's simply to learn but you can do the same with the good-old Java).

  • Before using the LOAD CSV, remember to write the USING PERIODIC COMMIT instructions in order to import big sets of data each iteration.

  • Before importing relations from CSV, remember to use CREATE CONSTRAINT ON <...> ASSERT <...> IS UNIQUE for the key-properties of your labels. It will have a huge impact on relationships creation.

  • Use MATCH(...), not CREATE(...) for the relationship procedure. It will avoids duplicates.

Regarding Neo4j performance:

  • First of all: read the official Neo4j page for tuning performance: https://neo4j.com/docs/operations-manual/current/performance/

  • Set a proper memory configuration for your Windows machine: configure manually the dbms.memory.pagecache.size parameter (in neo4j.conf file), if necessary.

  • Remember: the Java Virtual Machine is not a black box; you can improve its performance specifically for your application (editing the neo4j-community.vmoptions file). For example, you can set the max memory usage for the JVM (-Xmx parameter), you can also set the -XX:+UseG1GC parameter to using the G1 Garbage Collector (high performance, suggested by Oracle for production enviroment) (https://docs.oracle.com/cd/E40972_01/doc.70/e40973/cnf_jvmgc.htm#autoId0)

I'll post my neo4j.conf custom lines used for my configuration (just for reference, it may be a wrong setup for your application, beware):

dbms.memory.pagecache.size=3g
dbms.jvm.additional=-XX:+UseG1GC
dbms.jvm.additional=-XX:-OmitStackTraceInFastThrow
dbms.jvm.additional=-XX:+AlwaysPreTouch
dbms.jvm.additional=-XX:+UnlockExperimentalVMOptions
dbms.jvm.additional=-XX:+TrustFinalNonStaticFields
dbms.jvm.additional=-XX:+DisableExplicitGC

And my neo4j-community.vmoptions custom lines (again, just for reference):

-Xmx1024m
-XX:+UseG1GC
-OmitStackTraceInFastThrow
-XX:+AlwaysPreTouch
-XX:+UnlockExperimentalVMOptions
-XX:+TrustFinalNonStaticFields
-XX:+DisableExplicitGC

My test machine is a weak notebook equipped with an Core i3 (dual core), with 8GB of RAM, Windows 10 and Neo4j 3.2.1 Community Edition.

I'm capable of importing 7 millions of nodes in less than 3 minutes and 3.5 millions of relationships in less than 5 minutes (no recursive relationships).

In a more capable machine, with a specific crafted setup, Neo4j can do WAY better than this. Hope it helps.

这篇关于在Windows上使用Batch Import将数十亿个节点和关系导入Neo4j的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆