在庞大的数据集上创建关系时出错 [英] Error creating relationships over huge dataset

查看:103
本文介绍了在庞大的数据集上创建关系时出错的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的问题类似于此处指出的问题:

My question is similar to the one pointed here :

创建唯一节点数据集上的关系和关系NEO4J

我有2个表实体(Entities.txt)& 关系(EntitiesRelationships_Updated.txt)如下所示:两个表都位于Neo4j数据库中的导入文件夹中.我想做的是使用load csv命令加载表,然后创建关系.

I have 2 tables Entity (Entities.txt) & Relationships (EntitiesRelationships_Updated.txt) which looks like below: Both the tables are inside an import folder within the Neo4j database. What I am trying to do is load the tables using the load csv command and then create relationships.

如下表所示:如果ParentID为0,则表示ENT_ID没有父级.如果已填充,则它具有父级.例如,在下表中,ENT_ID = 3是ENT_ID = 4的父级,而ENT_ID = 1是ENT_ID = 2的父级

As in the table below: If ParentID is 0, it means that ENT_ID does not have a parent. If it is populated, then it has a parent. For example in the table below, ENT_ID = 3 is the parent of ENT_ID = 4 and ENT_ID = 1 is the parent of ENT_ID = 2

**Entity Table**
ENT_ID  Name  PARENTID
1      ABC     0     
2      DEF     1
3      GHI     0
4      JKG     3



**Relationship Table**
RID ENT_IDPARENT ENT_IDCHILD 
 1     1        2
 2     3        5

实体表有200万条记录,关系表有40万行

The Entity table has 2 million records and the relationship tables has about 400K lines

每个RID都有一个与之关联的特定标签.例如,RID = 1的关系是 A FATHER_OF B ; RID = 2,关系是 A MOTHER_OF B .同样,有20个这样的RID关联.

Each RID has a particular tag associated with it. For example RID = 1 has it that the relation is A FATHER_OF B; RID = 2 has it that the relation is A MOTHER_OF B. Similarly there are 20 such RIDs associated.

这两个都是txt格式.

Both of these are in txt format.

我的第一步是加载实体表.我使用了以下脚本:

My first step is to load the entity table. I used the following script:

USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:///Entities.txt" AS Entity FIELDTERMINATOR '|'
CREATE (n:Entity{ENT_ID: toInt(Entity.ENT_ID),NAME: Entity.NAME,PARENTID: toInt(Entity.PARENTID)})

此查询工作正常.加载280万条记录大约需要10分钟.我要做的下一步是对记录建立索引:

This query works fine. It takes about 10 minutes to load 2.8mil records. The next step I do is to index the records:

CREATE INDEX ON :Entity(PARENTID)
CREATE INDEX ON :Entity(ENT_ID)

此查询也可以正常运行.之后,我尝试使用与上面的链接类似的查询从关系表中创建关系:

This query runs fine as well. Following this I tried creating the relationships from the relationship table using a similar query as in the above link:

USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "file:///EntitiesRelationships_Updated.txt" AS Rships FIELDTERMINATOR '|' 
MATCH (n:A {ENT_IDPARENT : Rships.ENT_IDPARENT})
with Entity, n
MATCH (m:B {ENT_IDCHILD : Rships.ENT_IDCHILD})
with m,n
MERGE (n)-[r:RELATION_OF]->(m);

当我这样做时,我的查询保持运行约一个小时,并且停止在特定大小(以我的情况为2.2gb),我根据上面的链接关注了此查询.这包括来自以下解决方案的修改,但仍然无效

As I do this, my query keeps running for about an hour and it stops at a particular size(in my case 2.2gb) I followed this query based on the link above. This includes the edit from the solution below and still does not work

我还有一个查询,如下(基于以上链接).我想创建基于实体表的关系时运行此查询

I have one more query, which would be as follows (Based on the above link). I run this query as I want to create a relationship based of the Entity table

PROFILE
MATCH(Entity)
MATCH (a:Entity {ENT_ID : Entity.ENT_ID})
WITH Entity, a
MATCH (b:Entity {PARENTID : Entity.PARENTID})
WITH a,b
MERGE (a)-[r:PARENT_OF]->(b)

当我尝试运行此查询时,出现Java堆空间错误.不幸的是,我无法获得这些解决方案.

While I tried running this query, I get a Java Heap Space Error. Unfortunately, I have not been able to get the solution for these.

如果我做错了事,请您指教吗?

Could you please advice if I am doing something wrong?

推荐答案

此查询使您可以利用:Entity(ENT_ID)索引:

This query allows you to take advantage of your :Entity(ENT_ID) index:

MATCH (child:Entity)
WHERE child.PARENTID > 0
WITH child.PARENTID AS pid, child
MATCH (parent:Entity {ENT_ID : pid})
MERGE (parent)-[:PARENT_OF]->(child);

当属性值来自另一个节点时,Cypher不使用索引.为了解决这个问题,以上查询使用子句将child.PARENTID表示为变量(pid).此查询的时间复杂度应为O(N).您原始查询的复杂度为O(N * N).

Cypher does not use indices when the property value comes from another node. To get around that, the above query uses a WITH clause to represent child.PARENTID as a variable (pid). The time complexity of this query should be O(N). You original query has a complexity of O(N * N).

如果以上查询花费的时间太长或遇到可能与内存不足有关的错误,请尝试使用此变体,该变体一次创建1000个新关系.您可以将1000更改为适合您的任何数字.

If the above query takes too long or encounters errors that might be related to running out of memory, try this variant, which creates 1000 new relationships at a time. You can change 1000 to any number that is workable for you.

MATCH (child:Entity)
WHERE child.PARENTID > 0 AND NOT ()-[:PARENT_OF]->(child)
WITH child.PARENTID AS pid, child
LIMIT 1000
MATCH (parent:Entity {ENT_ID : pid})
CREATE (parent)-[:PARENT_OF]->(child)
RETURN COUNT(*);

WHERE子句过滤出已经具有父级关系的child节点.并且MERGE操作已更改为更简单的CREATE操作,因为我们已经确定该关系尚不存在.该查询返回创建的关系数量的计数.如果结果小于1000,则已创建所有父关系.

The WHERE clause filters out child nodes that already have a parent relationship. And the MERGE operation has been changed to a simpler CREATE operation, since we have already ascertained that the relationship does not yet exist. The query returns a count of the number of relationships created. If the result is less than 1000, then all parent relationships have been created.

最后,要使重复查询自动化,您可以安装 APOC neo4j服务器上的插件,并使用apoc.periodic.commit过程,该过程将重复调用查询,直到返回0.在此示例中,我使用limit参数10000:

Finally, to make the repeated queries automated, you can install the APOC plugin on the neo4j server and use the apoc.periodic.commit procedure, which will repeatedly invoke a query until it returns 0. In this example, I use a limit parameter of 10000:

CALL apoc.periodic.commit(
 "MATCH (child:Entity)
  WHERE child.PARENTID > 0 AND NOT ()-[:PARENT_OF]->(child)
  WITH child.PARENTID AS pid, child
  LIMIT {limit}
  MATCH (parent:Entity {ENT_ID : pid})
  CREATE (parent)-[:PARENT_OF]->(child)
  RETURN COUNT(*);",
{limit: 10000});

这篇关于在庞大的数据集上创建关系时出错的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆