Neo4j分片方面 [英] Neo4j sharding aspect

查看:242
本文介绍了Neo4j分片方面的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在研究Neo4j的可扩展性,并阅读了David Montag在2013年1月撰写的文档.

关于分片方面,他说2014年的第一个版本将提供第一个解决方案.

有人知道它是否已经完成,或者它的状态是否吗?

谢谢!

解决方案

公开:我是Neo4j开源图形数据库的赞助商Neo Technology的副产品.

现在,我们刚刚发布了Neo4j 2.0(今天实际上是2.0.1!),我们正在着手2.1版本,该版本主要面向(甚至更多)性能和性能.可扩展性.这会将图的上限增加到实际上不受限制的实体数,并改善了其他各种情况.

让我先设置背景,然后回答您的问题.

正如您可能从本文中看到的那样,Neo4j的当前水平缩放体系结构允许读取缩放,所有写入操作均由主设备进行,也可进行扇出.这实际上使您可以无限地进行读取扩展,并达到每秒数万次写入.

实际上,有一些生产Neo4j的客户(包括Snap Interactive和Glassdoor)在其社交圈中有大约十亿人口...在所有情况下,背后都是一个活跃且受到严重打击的网站,由相对温和的Neo4j处理集群(不超过5个实例).因此,这是一个关键功能:当今的Neo4j的计算密度令人难以置信,因此我们经常看到相当小的集群,这些集群处理着相当大的生产工作量……响应时间非常快.

有关当前体系结构的更多信息,请参见: www.neotechnology.com /neo4j-scales-for-enterprise/ 您可以在此处找到客户列表(包括沃尔玛和eBay之类的公司): neotechnology.com/customers /世界上最大的包裹递送公司之一使用Neo4j实时路由其所有包裹,峰值速度为每秒3000次路由操作,停机时间为零. (可以说,这是对图形数据库和NOSQL数据库的最大,最关键的用途;尽管不幸的是,我无法确定它是谁.)

从某种意义上说,tl; dr是,如果您还不如沃尔玛或eBay那么大,那您可能还可以.这只是简化了一点.在1%的情况下,您将事务写入工作负载维持在每秒数十万的速度.但是,即使在这种情况下,将所有数据加载到实时图形中通常也不是正确的选择.我们通常建议人们进行一些汇总或过滤,并且仅将更重要的内容带入图表. Intuit对此进行了很好的讨论.他们将十亿个B2B交易过滤到数量更少的每月总交易关系中,并按方向汇总了总数和货币金额.

输入分片...如今分片已变得越来越流行.这在很大程度上要归功于NOSQL的其他三类,其中联接是一种反模式.大多数查询只涉及读取或写入单个离散数据.就像联接是键值存储和文档数据库的反模式一样,分片也是图数据库的反模式.我的意思是……当您的所有数据在单个实例中的内存中可用时,将实现最佳性能,因为每当您进行读写操作时,都会在网络上来回跳动会大大降低速度,除非您真的非常聪明地知道如何分配数据……甚至是那时.我们的方法是双重的:

  1. 做尽可能多的聪明事,以支持极高的阅读和学习.写卷而不必求助于分片.这样可以为您提供最佳和最可预测的延迟和效率.换句话说:如果我们能够在不进行分片的情况下足以满足您的要求,那么总是将是最好的方法.上面的链接描述了其中的一些技巧,包括部署模式,该模式使您可以将数据分片到内存中,而不必将其分片到磁盘上(我们称为缓存分片).还有其他类似的技巧,还有更多的技巧...

  2. 在Neo4j中添加辅助架构模式,该模式支持.如果最好避免分片,为什么要这样做呢?随着越来越多的人发现图形有更多用途,并且数据量不断增加,我们认为最终这将是重要且不可避免的事情.例如,这将允许您在一个Neo4j集群(一个相当大的集群)中运行所有Facebook,而不仅仅是我们今天可以处理的图表的社交部分.我们已经在此方面做了大量工作,并且开发了一种我们认为可以平衡许多考虑因素的体系结构.这是一项多年的努力,尽管我们可以很容易地发布一个天真地分片的Neo4j版本(无疑会很流行),但我们可能不会这么做.我们想要做正确的事,这等于火箭科学.

I was looking on the scalability of Neo4j, and read a document written by David Montag in January 2013.

Concerning the sharding aspect, he said the 1st release of 2014 would come with a first solution.

Does anyone know if it was done or its status if not?

Thanks!

解决方案

Disclosure: I'm working as VP Product for Neo Technology, the sponsor of the Neo4j open source graph database.

Now that we've just released Neo4j 2.0 (actually 2.0.1 today!) we are embarking on a 2.1 release that is mostly oriented around (even more) performance & scalability. This will increase the upper limits of the graph to an effectively unlimited number of entities, and improve various other things.

Let me set some context first, and then answer your question.

As you probably saw from the paper, Neo4j's current horizontal-scaling architecture allows read scaling, with writes all going to master and fanning out. This gets you effectively unlimited read scaling, and into the tens of thousands of writes per second.

Practically speaking, there are production Neo4j customers (including Snap Interactive and Glassdoor) with around a billion people in their social graph... in all cases behind an active and heavily-hit web site, being handled by comparatively quite modest Neo4j clusters (no more than 5 instances). So that's one key feature: the Neo4j of today an incredible computational density, and so we regularly see fairly small clusters handling a substantially large production workload... with very fast response times.

More on the current architecture can be found here: www.neotechnology.com/neo4j-scales-for-the-enterprise/ And a list of customers (which includes companies like Wal-Mart and eBay) can be found here: neotechnology.com/customers/ One of the world's largest parcel delivery carriers uses Neo4j to route all of their packages, in real time, with peaks of 3000 routing operations per second, and zero downtime. (This arguably is the world's largest and most mission-critical use of a graph database and of a NOSQL database; though unfortunately I can't say who it is.)

So in one sense the tl;dr is that if you're not yet as big as Wal-Mart or eBay, then you're probably ok. That oversimplifies it only a bit. There is the 1% of cases where you have sustained transactional write workloads into the 100s of thousands per second. However even in those cases it's often not the right thing to load all of that data into the real-time graph. We usually advise people to do some aggregation or filtering, and bring only the more important things into the graph. Intuit gave a good talk about this. They filter a billion B2B transactions into a much smaller number of aggregate monthly transaction relationships with aggregated counts and currency amounts by direction.

Enter sharding... Sharding has gained a lot of popularity these days. This is largely thanks to the other three categories of NOSQL, where joins are an anti-pattern. Most queries involve reading or writing just a single piece of discrete data. Just as joining is an anti-pattern for key-value stores and document databases, sharding is an anti-pattern for graph databases. What I mean by that is... the very best performance will occur when all of your data is available in memory on a single instance, because hopping back and forth all over the network whenever you're reading and writing will slow things significantly down, unless you've been really really smart about how you distribute your data... and even then. Our approach has been twofold:

  1. Do as many smart things as possible in order to support extremely high read & write volumes without having to resort to sharding. This gets you the best and most predictable latency and efficiency. In other words: if we can be good enough to support your requirement without sharding, that will always be the best approach. The link above describes some of these tricks, including the deployment pattern that lets you shard your data in memory without having to shard it on disk (a trick we call cache-sharding). There are other tricks along similar lines, and more coming down the pike...

  2. Add a secondary architecture pattern into Neo4j that does support sharding. Why do this if sharding is best avoided? As more people find more uses for graphs, and data volumes continue to increase, we think eventually it will be an important and inevitable thing. This would allow you to run all of Facebook for example, in one Neo4j cluster (a pretty huge one)... not just the social part of the graph, which we can handle today. We've already done a lot of work on this, and have an architecture developed that we believe balances the many considerations. This is a multi-year effort, and while we could very easily release a version of Neo4j that shards naively (that would no doubt be really popular), we probably won't do that. We want to do it right, which amounts to rocket science.

这篇关于Neo4j分片方面的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆