如何返回每个其他节点的属性的最大计数 [英] How to return max counts per another node's properties

查看:78
本文介绍了如何返回每个其他节点的属性的最大计数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要计算每十年作曲家演奏音乐的次数,然后只返回每十年演奏次数最多的音乐.

I need to calculate how many times a composer's pieces of music were performed per decade, then return only the one piece with the most performances per decade.

除了每十年最高计数之外,此密码除过滤所有计数外,还执行其他所有操作.

This cypher does everything except filter all but the highest counts per decade.

match (c:Composer)-[:CREATED_BY]-(w:Work)<-[*..2]-(prog:Program) 
WHERE c.lastname =~ '(?i).*stravinsky.*' 
WITH w.title AS Title, prog.title AS Program, LEFT(prog.date, 3)+"0" AS Decade
RETURN Decade, Title, COUNT(Program) AS Total
ORDER BY Decade, Total DESC, Title

我已经花了好几个小时不停地动脑子,但找不到解决方法.

I've been banging my head for hours with variations on this but can't find the solution.

推荐答案

这似乎返回了您要查找的内容,但可能可以对其进行改进.

This seems to return what you're looking for but it can probably be improved.

MATCH (c:Composer)-[r:CREATED_BY]-(w:Work)<-[*..2]-(prog:Program)
WHERE c.lastname =~ '(?i).*stravinsky.*'
WITH LEFT(prog.date, 3)+"0" AS Decade, w.title AS Title, COUNT(prog.title) AS Total
ORDER BY Decade, Total DESC, Title
RETURN Decade, HEAD(COLLECT(Total)) AS Total, HEAD(COLLECT(Title)) AS Title
ORDER BY Decade

每十年只返回一个结果,但没有考虑到联系,所以对我来说有点不完整.我会考虑如何做到这一点,并在我提出好的建议时进行编辑.

It only returns one result from each decade but doesn't take ties into account, so it feels a little incomplete to me. I'll think about how to do that and edit if I come up with something good.

我将此字符串与 http://graphgen.neoxygen.io 一起使用,以在本地生成示例数据.

I used this string with http://graphgen.neoxygen.io to generate sample data locally.

(c:Composer {firstname: firstName, lastname: lastName} *10)<-[:CREATED_BY *n..1]-(w:Work {title: progLanguage} *75)<-[:PERFORMED *n..1]-(prog:Program {title: catchPhrase, date: date} *400)

胜利编辑

这是上面查询的原始版本,当有联系时将显示多个Works.

This is the raw version of the above query that will show multiple Works when there are ties.

MATCH (c:Composer)-[r:CREATED_BY]-(w:Work)<-[*..2]-(prog:Program)
WHERE c.lastname =~ '(?i).*stravinsky.*'
WITH LEFT(prog.date, 3)+"0" AS Decade, w.title AS Title, COUNT(prog.title) AS Total
ORDER BY Decade, Total DESC, Title
WITH Decade, Title, Total, HEAD(COLLECT(Total)) AS PerformedTotal
WITH Decade, [title in COLLECT(Title) WHERE Total = PerformedTotal] as Title, Total, PerformedTotal
ORDER BY PerformedTotal DESC
return Decade, HEAD(COLLECT(PerformedTotal)) as Totals, HEAD(COLLECT(Title)) as Titles
ORDER BY Decade

我认为应该可以对其进行重构,但是我似乎无法简化它.

I feel like it should be possible to refactor it but I can't seem to simplify it.

关于编写此答案的过程,我有很多笔记.即使这不是您想要的,这也是TLDR,因为它仍然很有趣.

I have a ton of notes about the process of writing this answer. Even if it's not exactly what you're looking for, here's the TLDR cause it was still interesting.

  • 如果可以的话,请摆脱该模糊搜索,找到一种方法对该属性进行索引或使用诸如Elasticsearch之类的外部索引.使用该正则表达式会给您带来巨大的性能损失.
  • Neo4j 2.2.M02中存在一个错误,如果将<-[*..2]-更改为几乎其他任何内容,则查询将崩溃.如果将"Cypher查询计划器"设置为Cypher 2.1,则如果第一行是MATCH (c:Composer)-[r:CREATED_BY]-(w)<-[r2:REL_TYPE]-(prog),则性能最佳.仅在第一个节点上使用标签来帮助WHERE完成其工作.始终始终使用节点和rel标识符.
  • 密码具有一些令人惊讶的行为.整个[title in COLLECT(Title) WHERE Total = PerformedTotal]都在同一行的后面使用变量.如果我将其拔出,则会崩溃.
  • Get rid of that fuzzy search if you can, find a way to index that property or use an external index like Elasticsearch. You take a massive performance hit when you use that regex.
  • There's a bug in Neo4j 2.2.M02 that makes the query crash if <-[*..2]- is changed to practically anything else. If you set the Cypher Query Planner to Cypher 2.1, performance is best if that very first line is MATCH (c:Composer)-[r:CREATED_BY]-(w)<-[r2:REL_TYPE]-(prog). Only use a label on that first node to help the WHERE do its job. Always always always use node and rel identifiers.
  • Cypher has some surprising behavior. That whole [title in COLLECT(Title) WHERE Total = PerformedTotal] is using variables from later in that same line. If I pull them out, it crashes.

更令人惊讶的行为是,它无法重构我期望的方式.我希望这样做,但是不能:

More surprising behavior was that it hasn't been possible to refactor the way I'd expect. I'd expect to do this but can't:

MATCH (c:Composer)-[r:CREATED_BY]-(w:Work)<-[*..2]-(prog:Program)
WHERE c.lastname =~ '(?i).*stravinsky.*'
WITH LEFT(prog.date, 3)+"0" AS Decade, w.title AS Title, COUNT(prog.title) AS Total
ORDER BY Decade, Total DESC, Title
WITH Decade, [title in COLLECT(Title) WHERE Total = HEAD(COLLECT(Total))] as Title, Total, HEAD(COLLECT(Total)) AS PerformedTotal
ORDER BY PerformedTotal DESC
return Decade, HEAD(COLLECT(PerformedTotal)) as Totals, HEAD(COLLECT(Title)) as Titles
ORDER BY Decade

另一项如何加快速度

如果您的查询可能有一些可能的路径,但您想避免使用[*..2],则可以通过提供有关尝试查找匹配项时应采用的路径的详细信息来加快处理速度.这个速度是否更快,实际上取决于它会占用多少个分支,而这些分支将是死胡同.如果只给它两个或三个路径,以便它可以完全忽略其他六个关系,则它可能会抵消过滤和以后发生的事情.当然,如果路径足够复杂,这可能会比值得的麻烦更多.

If you have a few potential paths your query may take but you want to avoid [*..2], you may be able to speed things up a bit by giving it specifics about the paths it should take when trying to find a match. Whether or not this is faster really depends on how many branches it can take that will be dead ends. If you can give it just two or three paths so it can completely ignore half a dozen other relationships, it will probably offset the filtering and things that happen later on. Of course, if the paths are complicated enough, this might be more trouble than it's worth.

您应该将其弹出到neo4j-shell中并添加PROFILE,在末尾添加分号,并查看数据库访问的次数,以确定最适合您的数据集.

You should pop this into the neo4j-shell and prepend PROFILE, add a semi-colon to the end, and look at the number of database accesses to determine which is best for your dataset.

MATCH (c:Composer)-[r:CREATED_BY]-(w)
WHERE c.lastname =~ '(?i).*Denesik.*'
OPTIONAL MATCH (w)-[r2:CONNECTED_TO]-(this_node)<-[r3:ONE_MORE]-(prog1)
OPTIONAL MATCH (w)<-[r4:PERFORMED]-(prog2)
OPTIONAL MATCH (w)-[r5:THIS_REL]->(this_node)-[r6:AGAIN_WITH_THE_RELS]->(prog3)
WITH FILTER(program in [prog1, prog2, prog3] WHERE program IS NOT NULL) AS progarray, w.title AS Title
UNWIND(progarray) as prog
WITH LEFT(prog.date, 3)+"0" AS Decade, COUNT(prog.title) AS Total, Title
ORDER BY Decade, Total DESC, Title
WITH Decade, Title, Total, HEAD(COLLECT(Total)) AS PerformedTotal
WITH Decade, [title in COLLECT(Title) WHERE Total = PerformedTotal] as Title, Total, PerformedTotal
ORDER BY PerformedTotal DESC
return Decade, HEAD(COLLECT(PerformedTotal)) as Totals, HEAD(COLLECT(Title)) as Titles
ORDER BY Decade;

最棘手的部分是,如果我们重用prog变量,它将把结果从每个OPTIONAL MATCH拖到下一个,本质上是试图进行过滤,而我们不会得到完全独立的路径. (为什么我们现在能够重用w有点超出我的范围了……)没关系.我们将结果放入数组中,过滤空结果,然后将其展开为包含所有有效结果的单个变量.之后,我们照常继续.

The trickiest part of this is that if we reuse the prog variable, it's going to drag the results from each OPTIONAL MATCH into the next one, essentially trying to filter, and we won't get completely separate paths. (Why we're able to reuse w is sort of beyond me right now...) That's OK, though. We take the results, put them into an array, filter the empty results, then unwind it back to a single variable containing all the valid results. After that, we continue as normal.

在我的测试中,如果使用正确的数据集,这看起来可能会更快. YMMV.

In my tests, it seems like this can be significantly faster with the right dataset. YMMV.

这篇关于如何返回每个其他节点的属性的最大计数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆