在Windows上使用Batch Import将数十亿个节点和关系导入Neo4j [英] Import billions of nodes and relationships to Neo4j using Batch Import on Windows
问题描述
我想为Neo4j插入数十亿个节点和关系.尽管我有16GB的RAM,但由于工作内存已超载,浏览器(Chrome)在30分钟后取消了"LOAD CSV"的使用.
显然,可以使用批处理导入器将大型数据集导入Neo4j(文档和下载 ,对Linux的解释).
要简单地使用它(不需要source/git/maven):
1. download 2.2 zip
2. unzip
3. run import.sh test.db nodes.csv rels.csv (on Windows: import.bat)
4. after the import point your /path/to/neo4j/conf/neo4j-server.properties
to this test.db directory, or copy the data over to your server cp -r
test.db/* /path/to/neo4j/data/graph.db/
You provide one tab separated csv file for nodes and one for
relationships (optionally more for indexes)
我很难在Windows上使用该插件.在Rik Van Bruggen的Linux视频中(上面的链接),他提到批处理导入程序的安装".
我解压缩了文件"download 2.2 zip".我在另一个文件夹中有CSV文件.如何使用Windows文档中提到的"import.bat"命令?在cmd中找不到命令...
在使用用于巨大数据集的工具之前,我可以建议您一些我刚刚学会的在几分钟之内导入数百万个节点的东西(适用于Windows的Neo4j社区版). /p>
关于Neo4j导入提示:
-
不要使用Web界面导入如此大的数据集,不可避免的是内存过载.
-
相反,请使用编程语言与Neo4j进行交互(我最近使用了官方的 Python 模块,它很容易学习,但是您可以使用古老的Java).
-
在使用
LOAD CSV
之前,请记住编写USING PERIODIC COMMIT
指令,以便每次迭代导入大数据集. -
在从CSV导入关系之前,请记住对标签的键属性使用
CREATE CONSTRAINT ON <...> ASSERT <...> IS UNIQUE
.这将对建立关系产生巨大的影响. -
对于关联过程,请使用
MATCH(...)
,而不是CREATE(...)
.它将避免重复.
关于Neo4j性能:
-
首先:阅读官方 Neo4j页面以调整性能: https://neo4j.com/docs/operations-manual/current/performance/
-
为Windows计算机设置适当的内存配置:如有必要,手动配置
dbms.memory.pagecache.size
参数(在neo4j.conf
文件中). -
记住: Java虚拟机不是黑匣子;您可以针对您的应用程序专门提高其性能(编辑neo4j-community.vmoptions文件). 例如,您可以设置JVM的最大内存使用量(
-Xmx
参数),还可以将-XX:+UseG1GC
参数设置为使用G1垃圾收集器(高性能,由Oracle建议用于生产环境)( https://docs.oracle.com/cd/E40972_01/doc. 70/e40973/cnf_jvmgc.htm#autoId0 )
我将发布用于配置的我的neo4j.conf自定义行(仅供参考,可能是 错误 的设置您的应用程序,请当心):
dbms.memory.pagecache.size=3g
dbms.jvm.additional=-XX:+UseG1GC
dbms.jvm.additional=-XX:-OmitStackTraceInFastThrow
dbms.jvm.additional=-XX:+AlwaysPreTouch
dbms.jvm.additional=-XX:+UnlockExperimentalVMOptions
dbms.jvm.additional=-XX:+TrustFinalNonStaticFields
dbms.jvm.additional=-XX:+DisableExplicitGC
我的neo4j-community.vmoptions自定义行(再次,仅供参考):
-Xmx1024m
-XX:+UseG1GC
-OmitStackTraceInFastThrow
-XX:+AlwaysPreTouch
-XX:+UnlockExperimentalVMOptions
-XX:+TrustFinalNonStaticFields
-XX:+DisableExplicitGC
我的测试机是一台性能较弱的笔记本电脑,配备了Core i3(双核),8GB RAM,Windows 10和Neo4j 3.2.1 Community Edition.
我能够在不到3分钟的时间内导入 700万个节点 ,并在不到5分钟的时间内导入 350万个关系 em> (无递归关系).
在功能更强大的计算机中,通过特定的精巧设置, Neo4j可以做得更好.希望对您有所帮助.
I want to insert a few billions nodes and relationships to Neo4j. Using "LOAD CSV" is being cancelled after 30 min by the browser (Chrome) as the working memory is overloaded, though I have 16GB RAM.
Large datasets apparently can be imported to Neo4j using the Batch Importer (Documentation & Download, Explanation for Linux ).
To simply use it (no source/git/maven required):
1. download 2.2 zip
2. unzip
3. run import.sh test.db nodes.csv rels.csv (on Windows: import.bat)
4. after the import point your /path/to/neo4j/conf/neo4j-server.properties
to this test.db directory, or copy the data over to your server cp -r
test.db/* /path/to/neo4j/data/graph.db/
You provide one tab separated csv file for nodes and one for
relationships (optionally more for indexes)
I struggle to use the plugin on Windows. In the Linux-Video by Rik Van Bruggen (link above) he mentions "installation of the batch importer".
I unzipped the file "download 2.2 zip". I have my CSVs in another folder. How do I use the "import.bat" command mentioned in the Documentation on WIndows? In cmd the command can't be found...
Before using the tool for gigantic datasets, I can suggest you few things I just learned importing millions of nodes in few minutes (Neo4j Community Edition for Windows).
Regarding Neo4j import tips:
Don't use the web interface to import such big datasets, memory overload is inevitable.
Instead, use a programming language to interact with Neo4j (I recently used the official Python module and it's simply to learn but you can do the same with the good-old Java).
Before using the
LOAD CSV
, remember to write theUSING PERIODIC COMMIT
instructions in order to import big sets of data each iteration.Before importing relations from CSV, remember to use
CREATE CONSTRAINT ON <...> ASSERT <...> IS UNIQUE
for the key-properties of your labels. It will have a huge impact on relationships creation.Use
MATCH(...)
, notCREATE(...)
for the relationship procedure. It will avoids duplicates.
Regarding Neo4j performance:
First of all: read the official Neo4j page for tuning performance: https://neo4j.com/docs/operations-manual/current/performance/
Set a proper memory configuration for your Windows machine: configure manually the
dbms.memory.pagecache.size
parameter (inneo4j.conf
file), if necessary.Remember: the Java Virtual Machine is not a black box; you can improve its performance specifically for your application (editing the neo4j-community.vmoptions file). For example, you can set the max memory usage for the JVM (
-Xmx
parameter), you can also set the-XX:+UseG1GC
parameter to using the G1 Garbage Collector (high performance, suggested by Oracle for production enviroment) (https://docs.oracle.com/cd/E40972_01/doc.70/e40973/cnf_jvmgc.htm#autoId0)
I'll post my neo4j.conf custom lines used for my configuration (just for reference, it may be a wrong setup for your application, beware):
dbms.memory.pagecache.size=3g
dbms.jvm.additional=-XX:+UseG1GC
dbms.jvm.additional=-XX:-OmitStackTraceInFastThrow
dbms.jvm.additional=-XX:+AlwaysPreTouch
dbms.jvm.additional=-XX:+UnlockExperimentalVMOptions
dbms.jvm.additional=-XX:+TrustFinalNonStaticFields
dbms.jvm.additional=-XX:+DisableExplicitGC
And my neo4j-community.vmoptions custom lines (again, just for reference):
-Xmx1024m
-XX:+UseG1GC
-OmitStackTraceInFastThrow
-XX:+AlwaysPreTouch
-XX:+UnlockExperimentalVMOptions
-XX:+TrustFinalNonStaticFields
-XX:+DisableExplicitGC
My test machine is a weak notebook equipped with an Core i3 (dual core), with 8GB of RAM, Windows 10 and Neo4j 3.2.1 Community Edition.
I'm capable of importing 7 millions of nodes in less than 3 minutes and 3.5 millions of relationships in less than 5 minutes (no recursive relationships).
In a more capable machine, with a specific crafted setup, Neo4j can do WAY better than this. Hope it helps.
这篇关于在Windows上使用Batch Import将数十亿个节点和关系导入Neo4j的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!