如何将社交网络关系csv(列表字典)文件导入neo4j图形数据库? [英] How to import social network relation csv (dict of list) file into neo4j graph database?

查看:94
本文介绍了如何将社交网络关系csv(列表字典)文件导入neo4j图形数据库?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我知道如何将CSV文件导入neo4j图形数据库,但是我发现它们全都位于这样的固定列数中:

I have known how to import CSV file into neo4j graph database, but I found that they are all in fix numbers of columns like this:

id1,id2,id3,id4,id5

id1,id2,id3,id4,id5

id2,id2,id3,id4,id5

id2,id2,id3,id4,id5

id3,id2,id3,id4,id5

id3,id2,id3,id4,id5

但是我有一个可变列CSV文件,用于描述人与人之间的关系.看起来像这样:

But I have a variable columns CSV file describing the relation between person. It looks like this:

id1,id2,id3,id4,id5

id1,id2,id3,id4,id5

id2,id2,id3,id4,id5,id6,id7

id2,id2,id3,id4,id5, id6, id7

id3,id2,id3

id3,id2,id3

这意味着id1人遵循id2,id3,id4,id5,id2人遵循id2,id3,id4,id5,id6,id7.

This means that the id1 person follow id2,id3,id4,id5, the id2 person follow id2,id3,id4,id5, id6, id7.

这个文件很大(大约6Gb),我应该如何将其导入neo4j?

And this file is huge (about 6Gb), how should I import it into neo4j?

推荐答案

以下是有关如何使用Cypher LOAD CSV 子句导入的一些提示.要处理真正的大数据导入任务,您可能需要查看 neo4j-import 工具.

Here are some hints on how to import using the Cypher LOAD CSV clause. To handle truly large data import tasks, you may want to look at the neo4j-import tool instead.

处理不同的列数不是问题,因为您可以将每个CSV文件行都视为项目的集合.

Handling varying numbers of columns is not a problem, since you can treat the each CSV file row as a collection of items.

您应该通过CSV文件以2次传递的方式导入数据.在第一遍中,创建所有 Person 节点.在第二遍中,匹配适当的节点,然后在它们之间创建关系.为了大大加快第二遍的速度,您应该首先创建索引唯一性约束(为您创建索引)以按ID匹配 Person 节点.

You should import your data in 2 passes through the CSV file. In the first pass, create all the Person nodes. In the second pass, match the appropriate nodes and then create relationships between them. To greatly speed up the second pass, you should first create either an index or a uniqueness constraint (which will create an index for you) for matching Person nodes by ID.

我认为:

  • 每个的CSV文件中都有一行,每行的第一列具有该人的唯一ID.
  • Person (人员)的行只有一列,如果该人员未关注任何人.
  • 您的neo4j模型看起来像这样:

  • There is one row in your CSV file per Person, with the first column of each row having that person's unique ID.
  • The row for a Person will have only one column if that person does not follow anyone.
  • Your neo4j model looks something like this:

(p1:人{id:123})-[:关注]->(p2:人{id:234})

(p1:Person {id: 123})-[:FOLLOWS]->(p2:Person {id: 234})

首先,创建一个唯一性约束:

First, create a uniqueness constraint:

CREATE CONSTRAINT ON (p:Person) ASSERT p.id IS UNIQUE;

然后,使用CSV文件第一列中的ID创建 Person 节点.如果在第1列中碰巧有重复的ID,我们使用 MERGE 来确保 LOAD 不会中止(由于唯一性约束).没有重复的ID,您可以改用 CREATE ,它应该更快.为避免内存不足,我们一次处理并提交10000行:

Then, create the Person nodes using the IDs in the first column of you CSV file. We use MERGE to ensure that LOAD does not abort (due to the uniqueness constraint) if there happened to be any duplicate IDs in column 1. If you are sure that there are no duplicate IDs, you can use CREATE instead, which should be faster. To avoid running out of memory, we process and commit 10000 rows at a time:

USING PERIODIC COMMIT 10000
LOAD CSV FROM "file:///varying.csv" AS row
MERGE (:Person {id: row[0]});

最后,在适当的 Person 节点之间创建关系.该查询使用 USING INDEX 提示来鼓励Cypher利用索引(由唯一性约束自动创建)来快速找到合适的 Person 节点.再次,为避免内存不足,我们一次处理10000行:

Finally, create the relationships between the appropriate Person nodes. This query uses USING INDEX hints to encourage Cypher to take advantage of the index (automatically created by the uniqueness constraint) to quickly find the appropriate Person nodes. Again, to avoid running out of memory, we process 10000 rows at a time:

USING PERIODIC COMMIT 10000
LOAD CSV FROM "file:///varying.csv" AS row
WITH row[0] AS pid1, row[1..] AS followed
UNWIND followed AS pid2
MATCH (p1:Person {id: pid1}), (p2:Person {id: pid2})
USING INDEX p1:Person(id)
USING INDEX p2:Person(id)
MERGE (p1)-[:FOLLOWS]->(p2);

这篇关于如何将社交网络关系csv(列表字典)文件导入neo4j图形数据库?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆