Pro的数据库像BigTable,SimpleDB [英] Pro's of databases like BigTable, SimpleDB
问题描述
新学校的数据存储范例(如Google BigTable和Amazon SimpleDB)专为可扩展性而设计,除其他外。基本上,禁止联接和非标准化是完成此操作的方式。
分布式数据库不像Orion所暗示的那么幼稚;已经在优化分布式数据集上的完全关系查询方面进行了相当多的工作。您可能想看看Teradata,Netezza,Greenplum,Vertica,AsterData等公司正在做什么。 (甲骨文在游戏中,最后,以及他们最近的公告;微软以过去称为DataAllegro的公司的名义购买了他们的Solition)。
也就是说,当数据扩展到TB级时,这些问题变得非常不平凡。如果你不需要严格的事务性和一致性保证,你可以从RDBM获得,它往往更容易反正规化,不做连接。特别是如果你不需要交叉参考太多。特别是如果你不进行ad-hoc分析,但需要通过任意转换进行编程访问。
非规范化被高估了。只是因为这是当你处理100 Tera时发生的事情,并不意味着每个开发人员都应该使用这个事实,他从来不打算学习数据库,并且由于糟糕的模式规划和查询优化而难以查询一百万行或两行。
但是,如果你在100 Tera范围,一定是...
其他原因,这些技术正在得到嗡嗡声 - 人们发现,一些事情从来不属于数据库,首先,并意识到,他们不处理在其特定领域的关系,但与基本的键值对。对于不应该存在于DB中的东西,Map-Reduce框架或者一些持久的,最终一致的存储系统是完全可能的。
<在全球范围内,我强烈推荐BerkeleyDB解决这些问题。
New school datastore paradigms like Google BigTable and Amazon SimpleDB are specifically designed for scalability, among other things. Basically, disallowing joins and denormalization are the ways this is being accomplished.
In this topic, however, the consensus seems to be that joins on large tables don't necessarilly have to be too expensive and denormalization is "overrated" to some extent
Why, then, do these aforementioned systems disallow joins and force everything together in a single table to achieve scalability? Is it the sheer volumes of data that needs to be stored in these systems (many terabytes)?
Do the general rules for databases simply not apply to these scales?
Is it because these database types are tailored specifically towards storing many similar objects?
Or am I missing some bigger picture?
Distributed databases aren't quite as naive as Orion implies; there has been quite a bit of work done on optimizing fully relational queries over distributed datasets. You may want to look at what companies like Teradata, Netezza, Greenplum, Vertica, AsterData, etc are doing. (Oracle got in the game, finally, as well, with their recent announcement; Microsoft bought their solition in the name of the company that used to be called DataAllegro).
That being said, when the data scales up into terabytes, these issues become very non-trivial. If you don't need the strict transactionality and consistency guarantees you can get from RDBMs, it is often far easier to denormalize and not do joins. Especially if you don't need to cross-reference much. Especially if you are not doing ad-hoc analysis, but require programmatic access with arbitrary transformations.
Denormalization is overrated. Just because that's what happens when you are dealing with a 100 Tera, doesn't mean this fact should be used by every developer who never bothered to learn about databases and has trouble querying a million or two rows due to poor schema planning and query optimization.
But if you are in the 100 Tera range, by all means...
Oh, the other reason these technologies are getting the buzz -- folks are discovering that some things never belonged in the database in the first place, and are realizing that they aren't dealing with relations in their particular fields, but with basic key-value pairs. For things that shouldn't have been in a DB, it's entirely possible that the Map-Reduce framework, or some persistent, eventually-consistent storage system, is just the thing.
On a less global scale, I highly recommend BerkeleyDB for those sorts of problems.
这篇关于Pro的数据库像BigTable,SimpleDB的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!