将一个大的csv文件加载到neo4j中 [英] Load a large csv file into neo4j

查看:34
本文介绍了将一个大的csv文件加载到neo4j中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想加载一个 csv,其中包含维基百科类别 rels.csv(类别之间的 400 万个关系)之间的关系.我尝试通过更改以下参数值来修改设置文件:

dbms.memory.heap.initial_size=8Gdbms.memory.heap.max_size=8Gdbms.memory.pagecache.size=9G

我的查询如下:

使用定期提交 10000从 CSV 加载"https://github.com/jbarrasa/datasets/blob/master/wikipedia/data/rels.csv?raw=true" AS 行匹配(来自:类别{ catId:行[0]})匹配(to:Category { catId: row[1]})CREATE (from)-[:SUBCAT_OF]->(to)

此外,我在 catId 和 catName 上创建了索引.尽管进行了所有这些优化,查询仍在运行(从昨天开始).

你能告诉我是否应该做更多的优化来加载这个 CSV 文件吗?

解决方案

这花费了太多时间.4 百万的关系应该需要几分钟甚至几秒钟.

我刚刚在 321 秒内加载了您共享的链接中的所有数据(Cats-90 和 Rels-231),而我的个人笔记本电脑上的内存设置不到您的一半.

dbms.memory.heap.initial_size=1Gdbms.memory.heap.max_size=4Gdbms.memory.pagecache.size=1512m

<块引用>

而且这还不是极限,可以进一步改进.

稍微修改的查询: LIMIT 增加了 10 倍

使用定期提交 100000从 CSV 加载"https://github.com/jbarrasa/datasets/blob/master/wikipedia/data/rels.csv?raw=true" AS 行匹配(来自:类别{ catId:行[0]})匹配(to:Category { catId: row[1]})CREATE (from)-[:SUBCAT_OF]->(to)

一些建议:

  1. 在用于搜索节点的字段上创建索引.(加载数据时不需要索引其他字段,以后可以做,会消耗不必要的内存)

  2. 不要将最大堆大小设置为充满系统 RAM.将其设置为 RAM 的 50%.

  3. Increase LIMIT:如果你增加 Heap(RAM),它不会提高性能,除非它被使用.当您将 LIMIT 设置为 10,000 时,堆的大部分将是空闲的.我能够使用 4G 堆加载限制为 100,000 的数据.您可以设置 200,000 或更多.如果它导致任何问题,请尝试减少它.
  4. 重要事项 确保在更改/设置配置后重新启动 Neo4j.(如果还没有完成).

<块引用>

下次运行 load CSV 查询时不要忘记删除以前的数据,因为它会创建重复项.

注意:我将文件下载到笔记本电脑并使用相同的文件,因此没有下载时间.

I want to load a csv that contains relationships between Wikipedia categories rels.csv (4 million of relations between categories). I tried to modify the setting file by changing the following parameter values:

dbms.memory.heap.initial_size=8G 
dbms.memory.heap.max_size=8G
dbms.memory.pagecache.size=9G

My query is as follows:

USING PERIODIC COMMIT 10000
LOAD CSV FROM 
"https://github.com/jbarrasa/datasets/blob/master/wikipedia/data/rels.csv?raw=true" AS row
    MATCH (from:Category { catId: row[0]})
    MATCH (to:Category { catId: row[1]})
    CREATE (from)-[:SUBCAT_OF]->(to)

Moreover, I created indexes on catId and catName. Despite all these optimizations, the query still running (since yesterday).

Can you tell me if there are more optimization that should be done to load this CSV file?

解决方案

It's taking too much time. 4 Millions of relationships should take a few minutes if not seconds.

I just loaded all the data from the link you shared in 321 seconds (Cats-90, and Rels-231) with less than half of your memory settings on my personal laptop.

dbms.memory.heap.initial_size=1G  
dbms.memory.heap.max_size=4G 
dbms.memory.pagecache.size=1512m

And this is not the limit, Can be improved further.

Slightly Modified query: Increased LIMIT 10 times

USING PERIODIC COMMIT 100000
LOAD CSV FROM 
"https://github.com/jbarrasa/datasets/blob/master/wikipedia/data/rels.csv?raw=true" AS row
    MATCH (from:Category { catId: row[0]})
    MATCH (to:Category { catId: row[1]})
    CREATE (from)-[:SUBCAT_OF]->(to)

Some suggestions:

  1. Create an index on the fields that are used to search nodes. (No need to index on others fields while loading data it can be done later, it consumes unnecessary memory)

  2. Don't set the max heap size to full of system RAM. Set it to 50% of RAM.

  3. Increase LIMIT: If you are increasing Heap(RAM) it will not increase the performance unless it is used. When you set LIMIT to 10,000 then most part of the Heap will be free. I am able to load data with limit 100,000 with 4G Heap. You can set 200,000 or more. If it causes any issue try decreasing it.
  4. IMPORTANT Make sure you restart the Neo4j after changing/setting configurations. (If not done already).

Don't forget to delete previous data when you run load CSV query next time, as it will create duplicates.

NOTE: I downloaded the files to the laptop and used same so there is no download time.

这篇关于将一个大的csv文件加载到neo4j中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆