在 Windows 上使用批量导入将数十亿个节点和关系导入 Neo4j [英] Import billions of nodes and relationships to Neo4j using Batch Import on Windows
问题描述
我想向 Neo4j 插入数十亿个节点和关系.尽管我有 16GB RAM,但由于工作内存过载,浏览器 (Chrome) 将在 30 分钟后取消使用LOAD CSV".
I want to insert a few billions nodes and relationships to Neo4j. Using "LOAD CSV" is being cancelled after 30 min by the browser (Chrome) as the working memory is overloaded, though I have 16GB RAM.
显然可以使用批量导入器将大型数据集导入 Neo4j(文档和下载, Linux 说明 ).
Large datasets apparently can be imported to Neo4j using the Batch Importer (Documentation & Download, Explanation for Linux ).
简单地使用它(不需要源代码/git/maven):
To simply use it (no source/git/maven required):
1. download 2.2 zip
2. unzip
3. run import.sh test.db nodes.csv rels.csv (on Windows: import.bat)
4. after the import point your /path/to/neo4j/conf/neo4j-server.properties
to this test.db directory, or copy the data over to your server cp -r
test.db/* /path/to/neo4j/data/graph.db/
You provide one tab separated csv file for nodes and one for
relationships (optionally more for indexes)
我很难在 Windows 上使用该插件.在 Rik Van Bruggen 的 Linux 视频(上面的链接)中,他提到了批量导入器的安装".
I struggle to use the plugin on Windows. In the Linux-Video by Rik Van Bruggen (link above) he mentions "installation of the batch importer".
我解压了文件下载 2.2 zip".我的 CSV 文件在另一个文件夹中.如何使用 WINdows 文档中提到的import.bat"命令?在cmd中找不到命令...
I unzipped the file "download 2.2 zip". I have my CSVs in another folder. How do I use the "import.bat" command mentioned in the Documentation on WIndows? In cmd the command can't be found...
推荐答案
在使用该工具处理庞大的数据集之前,我可以向您推荐一些我刚刚学会在几分钟内导入数百万个节点的东西(Neo4j Community Edition for Windows).
Before using the tool for gigantic datasets, I can suggest you few things I just learned importing millions of nodes in few minutes (Neo4j Community Edition for Windows).
关于 Neo4j 导入提示:
不要使用网页界面来导入这么大的数据集,内存过载是不可避免的.
Don't use the web interface to import such big datasets, memory overload is inevitable.
相反,使用编程语言与 Neo4j 交互(我最近使用了官方的 Python 模块,它只是为了学习,但您可以使用古老的 Java).
Instead, use a programming language to interact with Neo4j (I recently used the official Python module and it's simply to learn but you can do the same with the good-old Java).
在使用 LOAD CSV
之前,请记住编写 USING PERIODIC COMMIT
指令,以便每次迭代导入大量数据.
Before using the LOAD CSV
, remember to write the USING PERIODIC COMMIT
instructions in order to import big sets of data each iteration.
在从 CSV 导入关系之前,记得使用 CREATE CONSTRAINT ON <...>断言 <...>对于标签的关键属性来说是唯一的
.它将对建立关系产生巨大影响.
Before importing relations from CSV, remember to use CREATE CONSTRAINT ON <...> ASSERT <...> IS UNIQUE
for the key-properties of your labels. It will have a huge impact on relationships creation.
对关系过程使用MATCH(...)
,而不是CREATE(...)
.它将避免重复.
Use MATCH(...)
, not CREATE(...)
for the relationship procedure. It will avoids duplicates.
关于 Neo4j 性能:
首先:阅读官方 Neo4j 页面以调整性能:https://neo4j.com/docs/operations-manual/current/performance/
First of all: read the official Neo4j page for tuning performance: https://neo4j.com/docs/operations-manual/current/performance/
为您的 Windows 机器设置适当的内存配置:手动配置 dbms.memory.pagecache.size
参数(在 neo4j.conf
文件中),如果必要的.
Set a proper memory configuration for your Windows machine: configure manually the dbms.memory.pagecache.size
parameter (in neo4j.conf
file), if necessary.
记住:Java 虚拟机不是黑匣子;您可以专门针对您的应用程序提高其性能(编辑 neo4j-community.vmoptions 文件).比如你可以设置JVM的最大内存使用(-Xmx
参数),你也可以设置-XX:+UseG1GC
参数来使用G1 Garbage Collector(高性能,Oracle 建议用于生产环境)(https://docs.oracle.com/cd/E40972_01/doc.70/e40973/cnf_jvmgc.htm#autoId0)
Remember: the Java Virtual Machine is not a black box; you can improve its performance specifically for your application (editing the neo4j-community.vmoptions file).
For example, you can set the max memory usage for the JVM (-Xmx
parameter), you can also set the -XX:+UseG1GC
parameter to using the G1 Garbage Collector (high performance, suggested by Oracle for production enviroment) (https://docs.oracle.com/cd/E40972_01/doc.70/e40973/cnf_jvmgc.htm#autoId0)
我将发布用于我的配置的 neo4j.conf 自定义行(仅供参考,这可能是一个错误设置您的应用程序,请注意):
I'll post my neo4j.conf custom lines used for my configuration (just for reference, it may be a wrong setup for your application, beware):
dbms.memory.pagecache.size=3g
dbms.jvm.additional=-XX:+UseG1GC
dbms.jvm.additional=-XX:-OmitStackTraceInFastThrow
dbms.jvm.additional=-XX:+AlwaysPreTouch
dbms.jvm.additional=-XX:+UnlockExperimentalVMOptions
dbms.jvm.additional=-XX:+TrustFinalNonStaticFields
dbms.jvm.additional=-XX:+DisableExplicitGC
还有我的 neo4j-community.vmoptions 自定义行(同样,仅供参考):
And my neo4j-community.vmoptions custom lines (again, just for reference):
-Xmx1024m
-XX:+UseG1GC
-OmitStackTraceInFastThrow
-XX:+AlwaysPreTouch
-XX:+UnlockExperimentalVMOptions
-XX:+TrustFinalNonStaticFields
-XX:+DisableExplicitGC
我的测试机是一台配备Core i3(双核)、8GB内存、Windows 10和Neo4j 3.2.1社区版的弱笔记本.
My test machine is a weak notebook equipped with an Core i3 (dual core), with 8GB of RAM, Windows 10 and Neo4j 3.2.1 Community Edition.
我能够在不到 3 分钟内导入 700 万个节点,在不到 5 分钟内导入 350 万个关系(无递归关系).
I'm capable of importing 7 millions of nodes in less than 3 minutes and 3.5 millions of relationships in less than 5 minutes (no recursive relationships).
在功能更强大的机器中,经过特定的精心设计,Neo4j 可以做得比这更好.希望它有所帮助.
In a more capable machine, with a specific crafted setup, Neo4j can do WAY better than this. Hope it helps.
这篇关于在 Windows 上使用批量导入将数十亿个节点和关系导入 Neo4j的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!