什么是设计Cassandra数据模型的最佳实践? [英] What's The Best Practice In Designing A Cassandra Data Model?

查看:117
本文介绍了什么是设计Cassandra数据模型的最佳实践?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

什么是要避免的陷阱?你有什么休息时间吗?例如,我听说出口/进口Cassandra数据非常困难,令我想知道是否会阻碍将生产数据同步到开发环境。



BTW,它是很难找到Cassandra的好教程,唯一一个我有 http ://arin.me/code/wtf-is-a-supercolumn-cassandra-data-model 仍然很基本。



谢谢。 p>

解决方案

对我来说,主要的是决定是否使用OrderedPartitioner或RandomPartitioner。



如果使用RandomPartitioner,则无法进行范围扫描。这意味着您必须知道任何活动的确切密钥,包括清除旧数据。



所以如果你有很多的流失,除非你有一些魔法通过使用随机分割器,您可以轻松地丢失东西,这将导致磁盘空间泄漏并最终消耗所有存储空间。



另一方面,您可以询问有序分区器A列与B列之间的列系列X中有哪些键? - 它会告诉你然后,您可以清理它们。



但是,还有一个缺点。由于Cassandra不执行自动负载平衡,如果您使用有序分区器,很可能所有的数据都将只在一个或两个节点中结束,而其他数据将不会在其他节点中运行,这意味着您将浪费资源。



我没有任何简单的答案,除了你可以通过放置一个很短的哈希值(你可以从其他的容易枚举的东西)得到最好的两个世界数据源),例如用户ID的16位十六进制散列 - 将为您提供4个十六进制数字,后跟您真正想要使用的任何密钥。



然后,如果您有最近删除的用户列表,您可以将其ID和范围扫描哈希清理与之相关的任何内容。



<下一个棘手的一点是次要索引 - Cassandra没有任何 - 所以如果你需要查找X的Y,你需要在两个键下插入数据,或者有一个指针。同样,这些指针可能需要在他们指向的东西不存在的时候被清理,但是在这个基础上没有简单的查询方法,所以你的应用需要记住。



应用程序错误可能会离开您忘记的孤立的密钥,您无法轻松检测到它们,除非您写下一些垃圾收集器定期扫描数据库中的每个单个键(这将需要一段时间 - 但是您可以在块中进行操作)来检查不再需要的那些。



这些都不是基于实际使用,就是我在研究过程中所想到的。我们不会在生产中使用Cassandra。



编辑:Cassandra现在在中继线上有次要索引。


And what are the pitfalls to avoid? Are there any deal breaks for you? E.g., I've heard that exporting/importing the Cassandra data is very difficult, making me wonder if that's going to hinder syncing production data to development environment.

BTW, it's very hard to find good tutorials on Cassandra, the only one I have http://arin.me/code/wtf-is-a-supercolumn-cassandra-data-model is still pretty basic.

Thanks.

解决方案

For me, the main thing is a decision whether to use the OrderedPartitioner or RandomPartitioner.

If you use the RandomPartitioner, range scans are not possible. This means that you must know the exact key for any activity, INCLUDING CLEANING UP OLD DATA.

So if you've got a lot of churn, unless you have some magic way of knowing exactly which keys you've inserted stuff for, using the random partitioner you can easily "lose" stuff, which causes a disc space leak and will eventually consume all storage.

On the other hand, you can ask the ordered partitioner "what keys do I have in Column Family X between A and B" ? - and it'll tell you. You can then clean them up.

However, there is a downside as well. As Cassandra doesn't do automatic load balancing, if you use the ordered partitioner, in all likelihood all your data will end up in just one or two nodes and none in the others, which means you'll waste resources.

I don't have any easy answer for this, except you can get "best of both worlds" in some cases by putting a short hash value (of something you can enumerate easily from other data sources) on the beginning of your keys - for example a 16-bit hex hash of the user ID - which will give you 4 hex digits, followed by whatever the key is you really wanted to use.

Then if you had a list of recently-deleted users, you can just hash their IDs and range scan to clean up anything related to them.

The next tricky bit is secondary indexes - Cassandra doesn't have any - so if you need to look up X by Y, you need to insert the data under both keys, or have a pointer. Likewise, these pointers may need to be cleaned up when the thing they point to doesn't exist, but there's no easy way of querying stuff on this basis, so your app needs to Just Remember.

And application bugs may leave orphaned keys that you've forgotten about, and you'll have no way of easily detecting them, unless you write some garbage collector which periodically scans every single key in the db (this is going to take a while - but you can do it in chunks) to check for ones which aren't needed any more.

None of this is based on real usage, just what I've figured out during research. We don't use Cassandra in production.

EDIT: Cassandra now does have secondary indexes in trunk.

这篇关于什么是设计Cassandra数据模型的最佳实践?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆