Neo4j 超级节点问题 - 扇出模式 [英] Neo4j super node issue - fanning out pattern

查看:35
本文介绍了Neo4j 超级节点问题 - 扇出模式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是图形数据库领域的新手,正在研究 Neo4j 并学习 Cypher,我们正在尝试对图形数据库建模,这是一个相当简单的模型,我们有用户,我们得到了电影用户可以查看电影评价电影,创建播放列表播放列表可以拥有电影.

I'm new to the Graph Database scene, looking into Neo4j and learning Cypher, we're trying to model a graph database, it's a fairly simple one, we got users, and we got movies, users can VIEW movies, RATE movies, create playlists and playlists can HAVE movies.

问题是关于超级节点的性能问题.我将从我目前正在阅读的一本非常好的书中引用一些内容 - Rik Van Bruggen 的《Learning Neo4j》,所以这里是:

The question is regarding the Super Node performance issue. And I will quote something from a very good book I am currently reading - Learning Neo4j by Rik Van Bruggen, so here it is:

然后在数据集中出现一个非常有趣的问题,其中图的某些部分都连接到同一个节点.该节点,也称为密集节点或超级节点,成为图遍历的真正问题,因为图数据库管理系统必须评估所有相关的关系该节点以确定下一步将在图遍历中进行.

A very interesting problem then occurs in datasets where some parts of the graph are all connected to the same node. This node, also referred to as a dense node or a supernode, becomes a real problem for graph traversals because the graph database management system will have to evaluate all of the connected relationships to that node in order to determine what the next step will be in the graph traversal.

书中提出的这个问题的解决方案是让一个Meta节点与它有100个连接,第101个连接要链接到一个新的Meta节点,该节点链接到以前的Meta节点.

The solution to this problem proposed in the book is to have a Meta node with 100 connections to it, and the 101th connection to be linked to a new Meta node that is linked to the previous Meta Node.

我看到官方 Neo4j 博客上的一篇博文说他们将在不久的将来解决这个问题(博文来自 2013 年 1 月) - http://neo4j.com/blog/2013-whats-coming-next-in-neo4j/

I have seen a blog post from the official Neo4j Blog saying that they will fix this problem in the upcoming future (the blog post is from January 2013) - http://neo4j.com/blog/2013-whats-coming-next-in-neo4j/

更准确地说:

我们围绕大数据"计划的另一个项目是添加一些特定优化,以处理具有大量(数百万)关系的密集连接节点之间的遍历.(这个问题有时被称为超级节点"问题.)

Another project we have planned around "bigger data" is to add some specific optimizations to handle traversals across densely-connected nodes, having very large numbers (millions) of relationships. (This problem is sometimes referred to as the "supernodes" problem.)

您对这个问题有什么看法?我们应该采用 Meta 节点扇出模式还是采用每个教程似乎都在使用的基本关系?还有其他建议吗?

What are your opinions on this issue? Should we go with the Meta node fanning-out pattern or go with the basic relationship that every tutorial seem to be using? Any other suggestions?

推荐答案

更新 - 2020 年 10 月.这篇文章是该主题的最佳来源,涵盖超级节点的方方面面

UPDATE - October 2020. This article is the best source on this topic, covering all aspects of super nodes

(我在下面的原始答案)

(my original answer below)

这是个好问题.这不是一个真正的答案,但为什么我们不能在这里讨论这个?从技术上讲,我认为我应该将您的问题标记为主要基于意见";因为你明确征求意见,但我认为值得讨论.

It's a good question. This isn't really an answer, but why shouldn't we be able to discuss this here? Technically I think I'm supposed to flag your question as "primarily opinion based" since you're explicitly soliciting opinions, but I think it's worth the discussion.

无聊但诚实的答案是,它始终取决于您的查询模式.如果不知道您将针对这种数据结构发出什么样的查询,就真的没有办法知道最好"的数据结构.方法.

The boring but honest answer is that it always depends on your query patterns. Without knowing what kinds of queries you're going to issue against this data structure, there's really no way to know the "best" approach.

超级节点在其他领域也是问题.图数据库有时在某些方面很难扩展,因为其中的数据很难分区.如果这是一个关系数据库,我们可以垂直或水平分区.在具有超级节点的图形数据库中,一切都关闭";到其他一切.(阿拉斯加农民喜欢 Lady Gaga,纽约银行家也喜欢).不仅仅是图遍历速度,超级节点对于各种可扩展性来说都是一个大问题.

Supernodes are problems in other areas as well. Graph databases sometimes are very difficult to scale in some ways, because the data in them is hard to partition. If this were a relational database, we could partition vertically or horizontally. In a graph DB when you have supernodes, everything is "close" to everything else. (An Alaskan farmer likes Lady Gaga, so does a New York banker). Moreso than just graph traversal speed, supernodes are a big problem for all sorts of scalability.

Rik 的建议归结为鼓励您创建子集群";或分区"超级节点.对于某些查询模式,这可能是一个好主意,我并没有反对这个想法,但我认为隐藏在此处的是聚类策略的概念.您分配了多少个元节点?每个元节点有多少个最大链接?你是如何将这个用户分配到这个元节点(而不是其他一些)的?根据您的查询,这些问题将很难回答,也很难正确实施,或者两者兼而有之.

Rik's suggestion boils down to encouraging you to create "sub-clusters" or "partitions" of the super-node. For certain query patterns, this might be a good idea, and I'm not knocking the idea, but I think hidden in here is the notion of a clustering strategy. How many meta nodes do you assign? How many max links per meta-node? How did you go about assigning this user to this meta node (and not some other)? Depending on your queries, those questions are going to be very hard to answer, hard to implement correctly, or both.

另一种(但在概念上非常相似)的方法是克隆 Lady Gaga 大约一千次,复制她的数据并使其在节点之间保持同步,然后断言一堆相同".克隆之间的关系.这与元"没有什么不同.方法,但它的优点是它将 Lady Gaga 的数据复制到克隆中,并且Meta"被复制到副本中.node 不仅仅是一个愚蠢的导航占位符.不过,大多数相同的问题都适用.

A different (but conceptually very similar) approach is to clone Lady Gaga about a thousand times, and duplicate her data and keep it in sync between nodes, then assert a bunch of "same as" relationships between the clones. This isn't that different than the "meta" approach, but it has the advantage that it copies Lady Gaga's data to the clone, and the "Meta" node isn't just a dumb placeholder for navigation. Most of the same problems apply though.

不过,这里有一个不同的建议:这里有一个大规模的多对多映射问题.如果这对您来说真的是一个很大的问题,那么您可能最好将其分解为具有两列 (from_id, to_id) 的单个关系表,每列引用一个 neo4j 节点 ID.然后,您可能拥有一个主要是图形的混合系统(但有一些例外).这里有很多权衡;当然,您根本无法在 cypher 中遍历该 rel,但它会更好地扩展和分区,并且查询特定 rel 可能会快得多.

Here's a different suggestion though: you have a large-scale many-to-many mapping problem here. It's possible that if this is a really huge problem for you, you'd be better off breaking this out into a single relational table with two columns (from_id, to_id), each referencing a neo4j node ID. You then might have a hybrid system that's mostly graph (but with some exceptions). Lots of tradeoffs here; of course you couldn't traverse that rel in cypher at all, but it would scale and partition much better, and querying for a particular rel would probably be much faster.

这里有一个普遍的观察:无论我们是在谈论关系、图形、文档、K/V 数据库还是其他任何东西——当数据库变得非常大,并且性能要求变得非常强烈时,人们几乎不可避免地结束提出某种具有不止一种 DBMS 的混合解决方案.这是因为不可避免的现实,即所有数据库都擅长某些事情,而不擅长其他事情.因此,如果您需要一个最适合所有方面的系统,您将不得不使用一种以上的数据库.:)

One general observation here: whether we're talking about relational, graph, documents, K/V databases, or whatever -- when the databases get really big, and the performance requirements get really intense, it's almost inevitable that people end up with some kind of a hybrid solution with more than one kind of DBMS. This is because of the inescapable reality that all databases are good at some things, and not good at others. So if you need a system that's good at most everything, you're going to have to use more than one kind of database. :)

在这些情况下,neo4j 可能可以做很多优化,但在我看来,系统需要一些关于访问模式的提示,以便在这方面做得很好.在现有的 2,000,000 个关系中,如何对端点进行最佳集群?旧关系是否比新关系更重要,反之亦然?

There is probably quite a bit neo4j can do to optimize in these cases, but it would seem to me that the system would need some kinds of hints on access patterns in order to do a really good job at that. Of the 2,000,000 relations present, how to the endpoints best cluster? Are older relationships more important than newer, or vice versa?

这篇关于Neo4j 超级节点问题 - 扇出模式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆