Pro的数据库像BigTable,SimpleDB [英] Pro's of databases like BigTable, SimpleDB

查看:207
本文介绍了Pro的数据库像BigTable,SimpleDB的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

新学校的数据存储范例(如Google BigTable和Amazon SimpleDB)专为可扩展性而设计,除其他外。基本上,禁止联接和非标准化是完成此操作的方式。



这个主题,但是,一致似乎是大型表上的联接不必要太昂贵,反规范化在某种程度上被过度
为什么,那么,上述这些系统是否禁止连接,并在一个表中强制一切,以实现可扩展性?是否是需要存储在这些系统中的绝对数据量(数TB)?

数据库的一般规则是否不适用于这些量表?
是因为这些数据库类型是专门针对存储许多类似对象的吗?

或者我缺少一些更大的图片?

分布式数据库不像Orion所暗示的那么幼稚;已经在优化分布式数据集上的完全关系查询方面进行了相当多的工作。您可能想看看Teradata,Netezza,Greenplum,Vertica,AsterData等公司正在做什么。 (甲骨文在游戏中,最后,以及他们最近的公告;微软以过去称为DataAllegro的公司的名义购买了他们的Solition)。



也就是说,当数据扩展到TB级时,这些问题变得非常不平凡。如果你不需要严格的事务性和一致性保证,你可以从RDBM获得,它往往更容易反正规化,不做连接。特别是如果你不需要交叉参考太多。特别是如果你不进行ad-hoc分析,但需要通过任意转换进行编程访问。



非规范化被高估了。只是因为这是当你处理100 Tera时发生的事情,并不意味着每个开发人员都应该使用这个事实,他从来不打算学习数据库,并且由于糟糕的模式规划和查询优化而难以查询一百万行或两行。



但是,如果你在100 Tera范围,一定是...



其他原因,这些技术正在得到嗡嗡声 - 人们发现,一些事情从来不属于数据库,首先,并意识到,他们不处理在其特定领域的关系,但与基本的键值对。对于不应该存在于DB中的东西,Map-Reduce框架或者一些持久的,最终一致的存储系统是完全可能的。



<在全球范围内,我强烈推荐BerkeleyDB解决这些问题。


New school datastore paradigms like Google BigTable and Amazon SimpleDB are specifically designed for scalability, among other things. Basically, disallowing joins and denormalization are the ways this is being accomplished.

In this topic, however, the consensus seems to be that joins on large tables don't necessarilly have to be too expensive and denormalization is "overrated" to some extent Why, then, do these aforementioned systems disallow joins and force everything together in a single table to achieve scalability? Is it the sheer volumes of data that needs to be stored in these systems (many terabytes)?
Do the general rules for databases simply not apply to these scales? Is it because these database types are tailored specifically towards storing many similar objects?
Or am I missing some bigger picture?

解决方案

Distributed databases aren't quite as naive as Orion implies; there has been quite a bit of work done on optimizing fully relational queries over distributed datasets. You may want to look at what companies like Teradata, Netezza, Greenplum, Vertica, AsterData, etc are doing. (Oracle got in the game, finally, as well, with their recent announcement; Microsoft bought their solition in the name of the company that used to be called DataAllegro).

That being said, when the data scales up into terabytes, these issues become very non-trivial. If you don't need the strict transactionality and consistency guarantees you can get from RDBMs, it is often far easier to denormalize and not do joins. Especially if you don't need to cross-reference much. Especially if you are not doing ad-hoc analysis, but require programmatic access with arbitrary transformations.

Denormalization is overrated. Just because that's what happens when you are dealing with a 100 Tera, doesn't mean this fact should be used by every developer who never bothered to learn about databases and has trouble querying a million or two rows due to poor schema planning and query optimization.

But if you are in the 100 Tera range, by all means...

Oh, the other reason these technologies are getting the buzz -- folks are discovering that some things never belonged in the database in the first place, and are realizing that they aren't dealing with relations in their particular fields, but with basic key-value pairs. For things that shouldn't have been in a DB, it's entirely possible that the Map-Reduce framework, or some persistent, eventually-consistent storage system, is just the thing.

On a less global scale, I highly recommend BerkeleyDB for those sorts of problems.

这篇关于Pro的数据库像BigTable,SimpleDB的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆