Neo4j 分片方面 [英] Neo4j sharding aspect

查看:27
本文介绍了Neo4j 分片方面的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在研究 Neo4j 的可扩展性,并阅读了 David Montag 于 2013 年 1 月撰写的文档.

I was looking on the scalability of Neo4j, and read a document written by David Montag in January 2013.

关于分片方面,他说 2014 年的第一个版本将提供第一个解决方案.

Concerning the sharding aspect, he said the 1st release of 2014 would come with a first solution.

有谁知道它是否已完成或状态是否已完成?

Does anyone know if it was done or its status if not?

谢谢!

推荐答案

披露:我是 Neo Technology 的产品副总裁,Neo4j 开源图形数据库的赞助商.

Disclosure: I'm working as VP Product for Neo Technology, the sponsor of the Neo4j open source graph database.

现在我们刚刚发布了 Neo4j 2.0(今天实际上是 2.0.1!)我们正在着手发布 2.1 版本,该版本主要面向(甚至更多)性能和可扩展性.这会将图的上限增加到有效无限数量的实体,并改善其他各种情况.

Now that we've just released Neo4j 2.0 (actually 2.0.1 today!) we are embarking on a 2.1 release that is mostly oriented around (even more) performance & scalability. This will increase the upper limits of the graph to an effectively unlimited number of entities, and improve various other things.

让我先设置一些上下文,然后再回答您的问题.

Let me set some context first, and then answer your question.

正如您可能从论文中看到的那样,Neo4j 当前的水平扩展架构允许读取扩展,所有写入都将主控和扇出.这使您可以有效地无限读取扩展,并达到每秒数万次写入.

As you probably saw from the paper, Neo4j's current horizontal-scaling architecture allows read scaling, with writes all going to master and fanning out. This gets you effectively unlimited read scaling, and into the tens of thousands of writes per second.

实际上,有生产 Neo4j 客户(包括 Snap Interactive 和 Glassdoor)在他们的社交图谱中拥有大约 10 亿人……在所有情况下,都在一个活跃且热门的网站背后,由相对温和的 Neo4j 处理集群(不超过 5 个实例).所以这是一个关键特性:今天的 Neo4j 具有令人难以置信的计算密度,因此我们经常看到相当小的集群处理相当大的生产工作负载......响应时间非常快.

Practically speaking, there are production Neo4j customers (including Snap Interactive and Glassdoor) with around a billion people in their social graph... in all cases behind an active and heavily-hit web site, being handled by comparatively quite modest Neo4j clusters (no more than 5 instances). So that's one key feature: the Neo4j of today an incredible computational density, and so we regularly see fairly small clusters handling a substantially large production workload... with very fast response times.

有关当前架构的更多信息,请访问:www.neotechnology.com/neo4j-scales-for-the-enterprise/可以在此处找到客户列表(包括 Wal-Mart 和 eBay 等公司):neotechnology.com/customers/ 世界上最大的包裹递送承运商之一使用 Neo4j 实时路由所有包裹,峰值为每秒 3000 次路由操作,并且停机时间为零.(这可以说是世界上对图形数据库和 NOSQL 数据库的最大和最关键的使用;但不幸的是,我不能说它是谁.)

More on the current architecture can be found here: www.neotechnology.com/neo4j-scales-for-the-enterprise/ And a list of customers (which includes companies like Wal-Mart and eBay) can be found here: neotechnology.com/customers/ One of the world's largest parcel delivery carriers uses Neo4j to route all of their packages, in real time, with peaks of 3000 routing operations per second, and zero downtime. (This arguably is the world's largest and most mission-critical use of a graph database and of a NOSQL database; though unfortunately I can't say who it is.)

所以从某种意义上说,tl;dr 是,如果您的规模还没有 Wal-Mart 或 eBay 大,那么您可能没问题.这只是稍微简化了它.在 1% 的情况下,您将事务性写入工作负载维持在每秒 100 万次.然而,即使在这些情况下,将所有数据加载到实时图表中通常也不是正确的做法.我们通常建议人们做一些聚合或过滤,只将更重要的东西带入图中.忒给了一个很好的谈论这个.他们将 10 亿 B2B 交易过滤到数量少得多的每月总交易关系中,并按方向汇总计数和货币金额.

So in one sense the tl;dr is that if you're not yet as big as Wal-Mart or eBay, then you're probably ok. That oversimplifies it only a bit. There is the 1% of cases where you have sustained transactional write workloads into the 100s of thousands per second. However even in those cases it's often not the right thing to load all of that data into the real-time graph. We usually advise people to do some aggregation or filtering, and bring only the more important things into the graph. Intuit gave a good talk about this. They filter a billion B2B transactions into a much smaller number of aggregate monthly transaction relationships with aggregated counts and currency amounts by direction.

进入分片……现在分片已经很流行了.这在很大程度上要归功于其他三类 NOSQL,其中连接是一种反模式.大多数查询只涉及读取或写入单个离散数据.正如加入是键值存储和文档数据库的反模式一样,分片是图数据库的反模式.我的意思是...当您的所有数据都在单个实例的内存中可用时,将出现最佳性能,因为无论何时读写都会在网络上来回跳跃会显着减慢速度,除非您在如何分发数据方面非常精明……即便如此.我们的方法是双重的:

Enter sharding... Sharding has gained a lot of popularity these days. This is largely thanks to the other three categories of NOSQL, where joins are an anti-pattern. Most queries involve reading or writing just a single piece of discrete data. Just as joining is an anti-pattern for key-value stores and document databases, sharding is an anti-pattern for graph databases. What I mean by that is... the very best performance will occur when all of your data is available in memory on a single instance, because hopping back and forth all over the network whenever you're reading and writing will slow things significantly down, unless you've been really really smart about how you distribute your data... and even then. Our approach has been twofold:

  1. 尽可能多地做一些聪明的事情,以支持极高的阅读量&无需借助分片即可写入卷.这将为您提供最佳和最可预测的延迟和效率.换句话说:如果我们能够在不分片的情况下足够好地支持您的需求,那将永远是最好的方法.上面的链接描述了其中的一些技巧,包括部署模式,它允许您将数据分片在内存中,而不必将其分片到磁盘上(我们称之为缓存分片的技巧).还有其他类似的技巧,还有更多的技巧......

  1. Do as many smart things as possible in order to support extremely high read & write volumes without having to resort to sharding. This gets you the best and most predictable latency and efficiency. In other words: if we can be good enough to support your requirement without sharding, that will always be the best approach. The link above describes some of these tricks, including the deployment pattern that lets you shard your data in memory without having to shard it on disk (a trick we call cache-sharding). There are other tricks along similar lines, and more coming down the pike...

将辅助架构模式添加到 Neo4j 中,确实支持分片.如果最好避免分片,为什么要这样做?随着越来越多的人发现图形的更多用途,以及数据量的不断增加,我们认为最终这将是一件重要且不可避免的事情.例如,这将允许您在一个 Neo4j 集群(一个相当大的集群)中运行所有 Facebook……而不仅仅是我们今天可以处理的图表的社交部分.我们已经在这方面做了很多工作,并开发了一个我们认为平衡了许多考虑因素的架构.这是一项多年的努力,虽然我们可以很容易地发布一个天真的分片的 Neo4j 版本(这无疑会非常流行),但我们可能不会那样做.我们想把事情做对,这相当于火箭科学.

Add a secondary architecture pattern into Neo4j that does support sharding. Why do this if sharding is best avoided? As more people find more uses for graphs, and data volumes continue to increase, we think eventually it will be an important and inevitable thing. This would allow you to run all of Facebook for example, in one Neo4j cluster (a pretty huge one)... not just the social part of the graph, which we can handle today. We've already done a lot of work on this, and have an architecture developed that we believe balances the many considerations. This is a multi-year effort, and while we could very easily release a version of Neo4j that shards naively (that would no doubt be really popular), we probably won't do that. We want to do it right, which amounts to rocket science.

这篇关于Neo4j 分片方面的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆