从MySQL切换到Cassandra - 优点/缺点? [英] Switching from MySQL to Cassandra - Pros/Cons?

查看:1533
本文介绍了从MySQL切换到Cassandra - 优点/缺点?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对于一点背景 - 这个问题涉及在单个小EC2实例上运行的项目,并且将要迁移到一个中等。主要组件是Django,MySQL和大量的用python和java编写的自定义分析工具,它们做重型的
。同一台机器也运行Apache。

For a bit of background - this question deals with a project running on a single small EC2 instance, and is about to migrate to a medium one. The main components are Django, MySQL and a large number of custom analysis tools written in python and java, which do the heavy lifting. The same machine is running Apache as well.

数据模型如下所示 - 大量的实时数据来自各种网络传感器,理想情况下,我想建立一个long-poll方法,而不是每15分钟的方法(计算stats和写入数据库本身的限制)的当前轮询。一旦数据进入,我将原始版本存储在
MySQL中,让分析工具松散这些数据,并将统计信息存储在另外几个表中。所有这些都是使用Django渲染的。

The data model looks like the following - a large amount of real time data comes in streamed from various networked sensors, and ideally, I'd like to establish a long-poll approach rather than the current poll every 15 minutes approach (a limitation of computing stats and writing into the database itself). Once the data comes in, I store the raw version in MySQL, let the analysis tools loose on this data, and store statistics in another few tables. All of this is rendered using Django.

我需要的关系特性 -

Relational features I would need -



  • 分组

  • 多个表之间的多重关系

  • Sphinx在这给了我一个很好的全文本引擎,所以这也是必要的。 [在Cassandra,Lucandra项目似乎满足了这个需要]

  • Order by [SliceRange in Cassandra's API seems to satisy this]
  • Group by
  • Manytomany relations between multiple tables [Cassandra SuperColumns seem to do well for one to many]
  • Sphinx on this gives me a nice full text engine, so thats a necessity too. [On Cassandra, the Lucandra project seems to satisfy this need]

我的主要问题是数据读取是非常慢(和写也不是那么热)。我现在不想投入大量的金钱和硬件,我更喜欢可以随时间缩放的东西。

My major problem is that data reads are extremely slow (and writes aren't that hot either). I don't want to throw a lot of money and hardware on it right now, and I'd prefer something that can scale easily with time. Vertically scaling MySQL is not trivial in that sense (or cheap).

基本上,在阅读了很多关于NOSQL并尝试MongoDB,Cassandra和Voldemort之类的东西之后,我的问题是

So essentially, after having read a lot about NOSQL and experimented with things like MongoDB, Cassandra and Voldemort, my questions are,


  • 在中型EC2实例上,通过转移到像Cassandra 本文(pdf)绝对似乎暗示。目前,我想说,每分钟几百写就是常态。对于读取 - 由于数据每5分钟左右改变一次,缓存失效必须很快发生。在某些时候,它应该能够处理大量的并发用户。即使创建索引,MySQL在大型表上执行一些连接时,应用程序性能目前也会被杀死 - 即使创建索引,大约需要32000行的渲染。 (这也可能是EC2虚拟化I / O的工件)。

  • On a medium EC2 instance, would I gain any benefits in reads/writes by shifting to something like Cassandra? This article (pdf) definitely seems to suggest that. Currently, I'd say a few hundred writes per minute would be the norm. For reads - since the data changes every 5 minutes or so, cache invalidation has to happen pretty quickly. At some point, it should be able to handle a large number of concurrent users as well. The app performance currently gets killed on MySQL doing some joins on large tables even if indexes are created - something to the order of 32k rows takes more than a minute to render. (This may be an artifact of EC2 virtualized I/O as well). Size of tables is around 4-5 million rows, and there are about 5 such tables.

大家都谈到在多个节点上使用Cassandra,给定CAP定理和最终一致性。但是,对于刚刚开始增长的项目, 部署一个节点cassandra服务器是否有意义
?有什么警告吗?例如,它可以替换MySQL作为Django的后端吗? [这是建议吗?]

Everyone talks about using Cassandra on multiple nodes, given the CAP theorem and eventual consistency. But, for a project that is just beginning to grow, does it make sense to deploy a one node cassandra server? Are there any caveats? For instance, can it replace MySQL as a backend for Django? [Is this recommended?]

如果我改变,我猜我必须重写部分应用程序,

If I do shift, I'm guessing I'll have to rewrite parts of the app to do a lot more "administrivia" since I'd have to do multiple lookups to fetch rows.

只需使用MySQL作为键就可以了值存储 而不是关系引擎,并与之一起去?这样我可以利用大量稳定的API,以及一个稳定的引擎(并根据需要去关系)。 (Brett Taylor的帖子From Friendfeed on this - http://bret.appspot.com / entry / how-friendfeed-uses-mysql )。

感谢。

推荐答案

Cassandra而现在的其他分布式数据库不提供从sql使用的那种即席查询支持。这是因为你不能用性能来分发查询,所以重点是反规范化。

Cassandra and the other distributed databases available today do not provide the kind of ad-hoc query support you are used to from sql. This is because you can't distribute queries with joins performantly, so the emphasis is on denormalization instead.

然而,Cassandra 0.6(beta正式退出,但你可以构建从0.6分支自己,如果你不耐烦)支持Hadoop map / reduce的分析,这听起来像一个很适合你。

However, Cassandra 0.6 (beta officially out tomorrow, but you can build from the 0.6 branch yourself if you're impatient) supports Hadoop map/reduce for analytics, which actually sounds like a good fit for you.

Cassandra提供了极好的支持添加

Cassandra provides excellent support for adding new nodes painlessly, even to an initial group of one.

也就是说,在几百个写入/分钟后,你将在mysql上工作很长时间,长时间。 Cassandra更好的是一个键/值存储(更好,键/列familyfamily),但MySQL是更好的是一个关系数据库。 :)

That said, at a few hundred writes/minute you're going to be fine on mysql for a long, long time. Cassandra is much better at being a key/value store (even better, key/columnfamily) but MySQL is much better at being a relational database. :)

还没有对Cassandra(或其他nosql数据库)的django支持。他们说的是在1.2之后为下一个版本做一些事情,但是基于在pycon上与django开发者交谈,没有人真的知道那将是什么样子。

There is no django support for Cassandra (or other nosql database) yet. They are talking about doing something for the next version after 1.2, but based on talking to django devs at pycon, nobody is really sure what that will look like yet.

这篇关于从MySQL切换到Cassandra - 优点/缺点?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆