Neo4j Cypher查询查找未连接太慢的节点 [英] Neo4j Cypher query to find nodes that are not connected too slow

查看:483
本文介绍了Neo4j Cypher查询查找未连接太慢的节点的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

鉴于我们有以下的Neo4j模式(简化,但它显示了重要的一点)。有两种类型的节点 NODE VERSION VERSION s通过 VERSION_OF 关系连接到 NODE s 。 VERSION 节点确实具有和直到的两个属性,它们表示有效性时间范围 - 其中一个或两个都可以是 NULL (在Neo4j中不存在)来表示 unlimited NODE s可以通过 HAS_CHILD 关系连接。这些关系还有两个来自的<直到>表示有效时间跨度的两个属性 - 其中一个或两个都可以是 NULL (不存在于Neo4j中)表示 unlimited

编辑 VERSION 节点和 HAS_CHILD 关系的有效期是独立的(尽管这个例子巧合地表明它们是对齐的)。





这个例子显示了两个 NODE s A B A 有两个版本 s AV1 ,直到17/6/30和 AV2 17/1/17,而 B 仅限一个版本 BV1 ,无限制。 B 通过 HAS_CHILD 关系与 A 连接,直到6/30/17。



现在的挑战是在某个特定时刻查询不是(即根节点)的所有节点的图。鉴于上面的例子,如果查询日期是例如,查询应该只返回 B 。 17/1/17,但如果查询日期是例如,它应该返回 B A 。 2012年8月1日(因为 A 不再是 B 17/1/17以后的孩子)。

今天的查询大致类似于:

  MATCH(n1:NODE)
可选MATCH(n1)< - [c] - (n2:NODE),(n2)< - [:VERSION_OF] - (nv2:ITEM_VERSION)
WHERE(c.from <= {date}< ; = c.until)
AND(nv2.from< = {date}< = nv2.until)
WITH n1 WHERE c IS NULL
MATCH(n1)< :VERSION_OF] - (nv1:ITEM_VERSION)
WHERE nv1.from< = {date}< = nv1.until
RETURN n1,nv1
ORDER BY toLower(nv1.title)ASC
SKIP 0 LIMIT 15

这个查询在一般情况下工作得相当好,但是开始变慢地狱当用于大型数据集(可比较实际生产数据集)。使用20-30k NODE s(大约是 VERSION s的数量的两倍),(真实)查询大约需要500在Mac OS X上运行的小型码头集装箱上运行-700毫秒),这是可以接受的。但是对于1.5M NODE s(大约是 VERSION s数目的两倍),(真实)查询需要一点点在裸机服务器上运行超过1分钟(运行Neo4j以外的任何内容)。这是不可接受的。



我们有任何选择来调整这个查询吗?是否有更好的方法来处理 NODE 的版本(我怀疑这是性能问题)还是关系的有效性?我知道关系属性不能被编入索引,所以可能有更好的方案来处理这些关系的有效性。



任何帮助或者丝毫暗示都将不胜感激。

编辑来自Michael Hunger


  1. 根节点的百分比:

    使用当前示例数据集(1.5M节点),结果集包含大约2k行。第一个 MATCH


  2. ITEM_VERSION 节点 code>:



    我们使用 ITEM_VERSION nv2 将结果集筛选到 ITEM 节点,该节点在给定日期没有其他连接,其他 ITEM 节点。这意味着要么不存在对于给定日期有效的关系,要么所连接的项目不得具有对于给定日期有效的 ITEM_VERSION 。我试图说明这一点:

      // date 6/1/17 

    //因为关系无效
    (nv1 ...) - >(n1) - [X_HAS_CHILD ... 6/30/17] - >(n2)< ;-( nv2 ...)$ (n1) - [X_HAS_CHILD ...] - >(n2)< b
    ′ // // n1没有返回,因为关系和连接项n2有效
    (nv1 ...) - > - (nv2 ...)

    // n1返回,因为即使关系有效,连接项n2也无效
    (nv1 ...) - >(n1) - [X_HAS_CHILD。 (n2)< ;-( nv2 ... 6/30/17)


  3. 没有使用关系类型:

    这里的问题是,该软件具有用户定义的模式和 ITEM

    code>节点通过自定义关系类型连接。由于我们在关系上不能有多个类型/标签,这些类型关系的唯一共同特征是它们都以 X _ 开头。这里没有提到这个简单的例子。在这里用谓词类型(r)STARTS WITH'X _'帮助搜索?




您的1.5M节点中有多少百分比会被找到根据您的示例日期,如果您没有限制返回多少数据?或许这个问题并没有像最后的排序那么重要?



我不确定为什么你的第一部分有VERSION节点,至少您不会将它们描述为与确定根节点相关。



您没有使用关系类型。

  MATCH(n1:NODE)//匹配1.5M节点
//必须做1.5M *度可选匹配
可选MATCH(n1)< ; - [c:HAS_CHILD] - (n2)WHERE(c.from< = {date}< = c.until)
WITH n1 WHERE c IS NULL
//有多少根节点剩下?
//#根节点*版本度数(1..2)
MATCH(n1)< - [:VERSION_OF] - (nv1:ITEM_VERSION)
WHERE nv1.from< = {date}< = nv1.until
//必须将所有这些
与n1,nv1,toLower(nv1.title)一起作为标题
RETURN n1,nv1
ORDER BY标题ASC
SKIP 0 LIMIT 15


Given we have the following Neo4j schema (simplified but it shows the important point). There are two types of nodes NODE and VERSION. VERSIONs are connected to NODEs via a VERSION_OF relationship. VERSION nodes do have two properties from and until that denote the validity timespan - either or both can be NULL (nonexistent in Neo4j terms) to denote unlimited. NODEs can be connected via a HAS_CHILD relationship. Again these relationships have two properties from and until that denote the validity timespan - either or both can be NULL (nonexistent in Neo4j terms) to denote unlimited.

EDIT: The validity dates on VERSION nodes and HAS_CHILD relations are independent (even though the example coincidentally shows them being aligned).

The example shows two NODEs A and B. A has two VERSIONs AV1 until 6/30/17 and AV2 starting from 7/1/17 while B only has one version BV1 that is unlimited. B is connected to A via a HAS_CHILD relationship until 6/30/17.

The challenge now is to query the graph for all nodes that aren't a child (that are root nodes) at one specific moment in time. Given the example above, the query should return just B if the query date is e.g. 6/1/17, but it should return B and A if the query date is e.g. 8/1/17 (because A isn't a child of B as of 7/1/17 any more).

The current query today is roughly similar to that one:

MATCH (n1:NODE)
OPTIONAL MATCH (n1)<-[c]-(n2:NODE), (n2)<-[:VERSION_OF]-(nv2:ITEM_VERSION)
WHERE (c.from <= {date} <= c.until)
AND (nv2.from <= {date} <= nv2.until)
WITH n1 WHERE c IS NULL 
MATCH (n1)<-[:VERSION_OF]-(nv1:ITEM_VERSION)
WHERE nv1.from <= {date} <= nv1.until
RETURN n1, nv1 
ORDER BY toLower(nv1.title) ASC 
SKIP 0 LIMIT 15

This query works relatively fine in general but it starts getting slow as hell when used on large datasets (comparable to real production datasets). With 20-30k NODEs (and about twice the number of VERSIONs) the (real) query takes roughly 500-700 ms on a small docker container running on Mac OS X) which is acceptable. But with 1.5M NODEs (and about twice the number of VERSIONs) the (real) query takes a little more than 1 minute on a bare-metal server (running nothing else than Neo4j). This is not really acceptable.

Do we have any option to tune this query? Are there better ways to handle the versioning of NODEs (which I doubt is the performance problem here) or the validity of relationships? I know that relationship properties cannot be indexed, so there might be a better schema for handling the validity of these relationships.

Any help or even the slightest hint is greatly appreciated.

EDIT after answer from Michael Hunger:

  1. Percentage of root nodes:

    With the current example data set (1.5M nodes) the result set contains about 2k rows. That's less than 1%.

  2. ITEM_VERSION node in first MATCH:

    We're using the ITEM_VERSION nv2 to filter the result set to ITEM nodes that have no connection other ITEM nodes at the given date. That means that either no relationship must exist that is valid for the given date or the connected item must not have an ITEM_VERSION that is valid for the given date. I'm trying to illustrate this:

    // date 6/1/17
    
    // n1 returned because relationship not valid
    (nv1 ...)->(n1)-[X_HAS_CHILD ...6/30/17]->(n2)<-(nv2 ...)
    
    // n1 not returned because relationship and connected item n2 valid
    (nv1 ...)->(n1)-[X_HAS_CHILD ...]->(n2)<-(nv2 ...)
    
    // n1 returned because connected item n2 not valid even though relationship is valid
    (nv1 ...)->(n1)-[X_HAS_CHILD ...]->(n2)<-(nv2 ...6/30/17)
    

  3. No use of relationship-types:

    The problem here is that the software features a user-defined schema and ITEM nodes are connected by custom relationship-types. As we can't have multiple types/labels on a relationship the only common characteristic for these kind of relationships is that they all start with X_. That's been left out of the simplified example here. Would searching with the predicate type(r) STARTS WITH 'X_' help here?

解决方案

What Neo4j version are you using.

What percentage of your 1.5M nodes will be found as roots at your example date, and if you don't have the limit how much data comes back? Perhaps the issue is not in the match so much as in the sorting at the end?

I'm not sure why you had the VERSION nodes in your first part, at least you don't describe them as relevant for determining a root node.

You didn't use relationship-types.

MATCH (n1:NODE) // matches 1.5M nodes
// has to do 1.5M * degree optional matches
OPTIONAL MATCH (n1)<-[c:HAS_CHILD]-(n2) WHERE (c.from <= {date} <= c.until)
WITH n1 WHERE c IS NULL
// how many root nodes are left?
// # root nodes * version degree (1..2)
MATCH (n1)<-[:VERSION_OF]-(nv1:ITEM_VERSION)
WHERE nv1.from <= {date} <= nv1.until
// has to sort all those
WITH n1, nv1, toLower(nv1.title) as title
RETURN n1, nv1
ORDER BY title ASC 
SKIP 0 LIMIT 15

这篇关于Neo4j Cypher查询查找未连接太慢的节点的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆