给定节点A,在Neo4j中找到A的子图中具有线性时间复杂度的所有节点 [英] Given node A find all nodes in A's subgraph with linear time complexity in Neo4j

查看:62
本文介绍了给定节点A,在Neo4j中找到A的子图中具有线性时间复杂度的所有节点的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在评估Neo4j用于生产环境的过程,在做一些我认为简单的事情时遇到了一些困难.我设法解决了这个问题,但是采用了次优且非常复杂的方式,因此我在考虑是否可以有一种更简单的方法来完成同一件事.

I am in the process of evaluating Neo4j for use in a production environment and I have encountered some difficulty when doing something that I expected to be simple. I have managed to solve it, but in a suboptimal and quite complicated way so I was thinking if there might be a simpler way to accomplish the same thing.

序言

Neo4j版本2.3.2

Neo4j version 2.3.2

TL; DR

这是一个很长的解释,因此摘要如下:

This is a bit of a long explanation so the summary is as follows:

给定节点A,我需要找到A的子图中所有复杂度为O(number_of_vertices + number_of_edges)的节点.

Given node A, I need to find all nodes in the subgraph of A with a complexity of O(number_of_vertices + number_of_edges).

问题

对于我们的用例,我们有一个图,针对一种特定类型的关系,该图被分解为较小的不连续子图,每个子图不超过几十个节点.我们要完成的工作是,给定这些节点之一的索引ID,以发现子图中的所有节点(同时将图视为无向).另一个特点是我们的节点始终具有从每个节点回到自身的反射边缘.

For our use case we have a graph that, with respect to one particular type of relationship, is broken up into smaller disconnected subgraphs of no more than a couple of dozen nodes each. What we are trying to accomplish is, given an indexed id from one of those nodes discover all nodes in the subgraph (while treating the graph as being undirected). The other peculiarity is that our nodes always have the reflexive edge from every node back to itself.

从算法上来说,我们需要的是广度优先搜索,预期复杂度为O(num_of_vertices + num_of_edges).由于我们的图不是稠密的,因此图中的边数与顶点数大致成线性关系,因此总体复杂度应与其中的顶点数成线性关系.

Algorithmically all we need is a breadth first search with an expected complexity of O(num_of_vertices + num_of_edges). Since our graphs are not dense the number of edges in them is roughly linear to the number of number of vertices so the overall complexity should be linear to the number of vertices in it.

测试图

为简单起见,我已将此测试图制作为完全连接的图.既然要比较密码查询就不会影响结果.

For simplicity I have made this test graph a fully connected one. Since the point is to compare cypher queries this does not affect the results.

命令:

  • 创建(:Label {id:1}),(:Label {id:2}),(:Label {id:3}),(:Label {id:4})
  • 匹配(a),(b)合并(a)-[:REL]->(b)

简单查询

我尝试获得所需结果的第一个查询如下:

The first query I tried to get the desired results is the following:

  • MATCH(a:Label {id:1})-[:REL * 1 ..]-(b:Label)RETURN DISTINCT b

该查询从未终止.当我为关系模式添加上限并分析查询时,得到以下信息:

That query never terminated. When I added an upper bound for the relationship pattern and profiled the query I got the following:

  • 深度4:3902个数据库访问
  • 深度8:1714982数据库访问

因此,与其发现顶点和边的数量不是线性的,倒不如寻找所有可能的路径,这些路径当然会随着深度而爆炸.

So instead of being linear to the number of vertices and edges this looks like it's finding all possible paths which of course explodes with depth.

性能更好的查询

为此,我改写了以下查询:

To accomplish this I wrote the following query instead:

https://gist.github.com/Dalamar42/1ec93cd74b01c145e7bd

(这将搜索到深度2.重复第6-16行以搜索到深度4、6、8,依此类推)

(This will search to depth 2. Duplicate lines 6-16 to search to depth 4, 6, 8, and so on)

查询执行以下操作:

  1. 获取入口节点
  2. 将该节点添加到 nodes_found 和另一个 nodes_to_visit
  3. 的集合中
  4. 对于 nodes_to_visit 中的每个节点A,将其边沿移至节点B
  5. 从节点B中仅保留不在 nodes_found 中的节点作为节点C
  6. nodes_to_visit 设置为节点C,并将 nodes_found 设置为其先前值加上节点C
  7. 重复至所需深度
  1. Get the entry node
  2. Add that node to a collection of nodes_found and another of nodes_to_visit
  3. For every node A in nodes_to_visit follow its edges to node B
  4. From nodes B keep only nodes that are not in nodes_found as nodes C
  5. Set nodes_to_visit to nodes C and set nodes_found to its previous value plus nodes C
  6. Repeat to the desired depth

除了一个复杂度外,此查询几乎应具有BFS的复杂度.据我了解,每个中间的MATCH/WHERE都必须匹配至少一个节点,否则cypher会返回空值,而忽略在先前步骤中找到的节点.为此,我将步骤4更改为:

This query should almost have the complexity of BFS except for one complexity. From what I understand every intermediate MATCH/WHERE needs to match at least one node otherwise cypher returns empty disregarding nodes found in the previous steps. I worked around this by changing step 4 to:

从节点B开始,仅将节点A和不在 nodes_found 中的节点保留为节点C"

"From nodes B keep only node A and nodes that are not in nodes_found as nodes C"

由于所有节点都具有反射性边缘,因此节点A始终位于节点B的集合中,并且通过始终保持它,我确保查询的那部分将始终与至少一个节点匹配.

Because all nodes have the reflexive edge, node A will always be in the set of nodes B and by always keeping it I make sure that that part of the query will always match at least one node.

这意味着该查询存在以下问题:

This means this query has the following problems:

  • 我需要定义最大搜索深度
  • 增加深度并不是免费的,因为我每次都在探索A的边缘
  • 对于某些人来说,此查询很麻烦,特别是考虑到这实际上只是必须执行几次的较大查询的一部分

此查询的好处是我的性能会好得多

The upside of this query is that I am getting much better performance

  • 对于深度4的搜索:65个db访问(相比于哑"查询中的3902)
  • 对于深度6的搜索:97 db访问(与哑"查询中的1714982相比)

更好的解决方案? 有谁知道更好/更简单的解决方案?我是否缺少Cypher的某些明显功能?在文档中找不到任何内容.

Better solution? Does anyone know of a better/simpler solution? Am I missing some obvious feature of Cypher? I wasn't able to find anything going through the documentation.

谢谢

推荐答案

[更新]

这是一种与您相似的更简单方法,但是它使用

Here is a simpler approach that is similar to yours, but it uses the COALESCE function to avoid having to artificially add the "entry node" into every "rim node" collection. (By "rim node", I mean a previously unencountered node that was found in the most recent match.)

该查询假定您已在:Label(id)上创建了索引以加快第一个MATCH的速度.顶部仅用于获取入口节点"并初始化"res"(或结果)和"rim"集合.后续各节只是彼此的精确复制,可以重复进行以符合所需的搜索深度.

The query assumes you have created an index on :Label(id) to speed up the first MATCH. The top section is just for getting the "entry node" and initializing the "res" (or result) and "rim" collections. The subsequent sections are just exact copies of each other, and can be repeated to match the desired depth of your search.

深度为2,如此处所示,仅消耗40 db命中.

With a depth of 2, as shown here, only 40 db hits are expended.

注1:给定测试数据,只需要1的深度即可.在这种情况下,只会消耗16个数据库匹配.

注2:每节中的第3个WITH子句用于强制NULL进入空的rim集合.这是因为如果要求UNWIND取消展开空集合,它将中止查询.由于某些原因,将[NULL]而不是[]传递给COALLESCE()不能按预期工作.

Note2: The 3rd WITH clause in each section is there to force a NULL into an empty rim collection. This is because UNWIND will abort the query if it is asked to unwind an empty collection. For some reason, passing [NULL] instead of [] to COALLESCE() does not work as expected.

MATCH (a:Label { id:1 })
USING INDEX a:Label(id)
WITH COLLECT(a) AS res
WITH res, res AS rim

UNWIND rim AS a
OPTIONAL MATCH (a)-[:REL]-(b:Label)
WHERE NOT b IN res
WITH res, COALESCE(COLLECT(DISTINCT b),[]) AS rim
WITH rim, res + rim AS res
WITH res, CASE rim WHEN [] THEN [NULL] ELSE rim END AS rim

UNWIND rim AS a
OPTIONAL MATCH (a)-[:REL]-(b:Label)
WHERE NOT b IN res
WITH res, COALESCE(COLLECT(DISTINCT b),[]) AS rim
WITH rim, res + rim AS res
WITH res, CASE rim WHEN [] THEN [NULL] ELSE rim END AS rim

RETURN res;

这篇关于给定节点A,在Neo4j中找到A的子图中具有线性时间复杂度的所有节点的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆