Neo4j Cypher:快速找到最大的断开子图 [英] Neo4j Cypher: Finding the largest disconnected subgraph fast

查看:301
本文介绍了Neo4j Cypher:快速找到最大的断开子图的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个具有一百万个节点的图.其中有许多断开连接的子图.我想知道最大的断开子图是什么.

I have a graph with one million nodes. There are many disconnected subgraphs within it. I would like to know what is the largest disconnected subgraph.

例如,在此图形示例中,我们获得了三个断开连接的子图,因此在这种情况下,输出将为7.

For instance this in this graph example we got three disconnected subgraph, so for this case the output will be 7.

我尝试过,但是要花很长时间,

I tried this but it is taking a long time,

match p = ()-[*]-() return MAX(length(p)) as l order by l desc limit 1

推荐答案

您的查询将仅返回两个单独节点之间的最长路径,而不返回最大连接子图的大小.

Your query will only ever return the longest path between two separate nodes, not the size of the largest connected subgraph.

不幸的是,Neo4j当前不支持子图操作,并且我也不认为APOC程序在这里也有任何功能.

Unfortunately Neo4j does not currently have any native support for subgraph operations, and I don't think APOC Procedures has anything here either.

Cypher中有多种查找子图的方法,但是我能想到的查询不是快速或高效的,并且对于大图来说可能会超时.这是一遍又一遍,不建议您这样做,它可能会超时,但是,如果可行,真棒:

There are ways in Cypher to find subgraphs, but the queries I can think of are not fast or performant, and are likely to time out with large graphs. Here's one, and again, this is not recommended, it is likely to time out for you, but if it works, awesome:

MATCH (n)-[*0..]-(subgraphNode)
WITH n, COUNT(DISTINCT subgraphNode) as subSize
RETURN MAX(subSize)

如果要经常执行查询,而不是一次执行,那么建议您使用一种跟踪子图的方法.

If this is to be a query run often, or every so often, instead of only once, then I'd recommend a means of tracking your subgraphs.

虽然我可以提供一种创建子图跟踪的方法,但是在整个图操作(用于合并子图,划分为较小的子图或创建新的子图的操作)中保持此更新的方法注定会比较棘手,并且您很可能会需要某种Java扩展来执行事务后处理以保持这一状态.

While I can give an approach to creating subgraph tracking, the approach for keeping this updated across graph operations (those that merge subgraphs, divide into smaller subgraphs, or create new subgraphs) is bound to be trickier, and you'll likely need some kind of Java extension to perform post-transaction processing to maintain this.

此外,这种方法最好在维护窗口中不进行写操作的情况下完成.

Also, this approach is best done during a maintenance window when no write operations are occurring.

此方法的最终目标是将一个:Subgraph节点附加到每个断开连接的子图上,这将使以后对子图进行的操作变得更加容易,包括您找到最大的断开子图的情况.

The end-goal for this is to attach a single :Subgraph node to every disconnected subgraph, which will make future operations on subgraphs much easier, including your case of finding the largest disconnected subgraph.

实现该目标的总体方法是先标记图形中的所有节点(使用:Unprocessed之类的标签),然后在对:Unprocessed节点的批量查询中,找到它们所属的整个断开连接的子图,并附加单个:Subgraph节点,然后从子图中删除:Unprocessed标签.

The overall approach to fulfilling that goal is to first label all nodes in your graph (with a label like :Unprocessed), then, in batched queries for :Unprocessed nodes, find the entire disconnected subgraph they are a part of, attach a single :Subgraph node to it, and then remove the :Unprocessed label from the subgraph.

因此,首先,标记数据库中的所有节点:

So, first, label all nodes in your db:

MATCH (n)
SET n:Unprocessed

接下来,进行批处理操作.您将要使用APOC程序来进行批处理(这也将利用整个子图从:Unprocessed标签中删除的优势,因为我们在处理它们时...我们不想对子图进行多余的操作).

Next, the batch operation. You'll want to use APOC Procedures to allow batch processing (which will also take advantage of entire subgraphs being removed from the :Unprocessed label as we process them...we don't want to redundantly perform operations on subgraphs).

CALL apoc.periodic.commit("
// only process a batch of :Unproccessed nodes at a time
MATCH (n:Unprocessed)
WITH n LIMIT {limit}
// subgraphNode will be all nodes in the subgraph including n
MATCH (n)-[*0..]-(subgraphNode)
WITH DISTINCT n, subgraphNode
REMOVE subgraphNode:Unprocessed
// find attach point node in each subgraph with smallest id
WITH n, min(id(subgraphNode)) as attachId
WITH DISTINCT attachId
MATCH (attachNode)
WHERE id(attachNode) = attachId
CREATE (attachNode)<-[:SUBGRAPH]-(:Subgraph)
RETURN count(*)
",{limit:100})

您可以根据需要调整限制.较低的限制实际上可能会更好,因为这样可以减少相同子图的节点上的冗余操作.

You can adjust your limit as necessary. A lower limit might actually work better, as this may reduce redundant operations on nodes of the same subgraph.

现在所有断开连接的子图都附加了:Subgraph节点,您可以对每个子图进行更快,更轻松的查询.因此,要找到最大的断开子图,可以使用:

Now that all disconnected subgraphs have a :Subgraph node attached, you can make faster and easier queries for each subgraph. So, to find the largest disconnected subgraph, you might use:

MATCH (sub:Subgraph)-[*]-(subgraphNode)
WITH sub, COUNT(DISTINCT subgraphNode) as subSize
RETURN MAX(subSize)

编辑

与使用可变关系匹配相比,我发现了一种收集子图节点的更快方法.使用NODE_GLOBAL唯一性的APOC的Path Expander功能应该执行得更快.这是经过修改以使用此方法的相关查询.

I found a faster means of gathering subgraph nodes compared to using a variable relationship match. APOC's Path Expander functionality, using NODE_GLOBAL uniqueness, should perform faster. Here are the relevant queries modified to use this approach.

CALL apoc.periodic.commit("
// only process a batch of :Unproccessed nodes at a time
MATCH (n:Unprocessed)
WITH n LIMIT {limit}
// subgraphNode will be all nodes in the subgraph including n
CALL apoc.path.expandConfig(n,{bfs:true, uniqueness:"NODE_GLOBAL"}) 
  YIELD path
WITH n, LAST(NODES(path)) as subgraphNode
REMOVE subgraphNode:Unprocessed
// find attach point node in each subgraph with smallest id
WITH n, min(id(subgraphNode)) as attachId
WITH DISTINCT attachId
MATCH (attachNode)
WHERE id(attachNode) = attachId
CREATE (attachNode)<-[:SUBGRAPH]-(:Subgraph)
RETURN count(*)
",{limit:100})

每个子图的处理:

MATCH (sub:Subgraph)
CALL apoc.path.expandConfig(sub,{minLevel:1, bfs:true, uniqueness:"NODE_GLOBAL"}) 
  YIELD path
WITH sub, LAST(NODES(path)) as subgraphNode
WITH sub, COUNT(DISTINCT subgraphNode) as subSize
RETURN MAX(subSize)

这篇关于Neo4j Cypher:快速找到最大的断开子图的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆