设计 Cassandra 数据模型的最佳实践是什么? [英] What's The Best Practice In Designing A Cassandra Data Model?

查看:26
本文介绍了设计 Cassandra 数据模型的最佳实践是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

要避免哪些陷阱?你有什么交易中断吗?例如,我听说导出/导入 Cassandra 数据非常困难,让我怀疑这是否会妨碍将生产数据同步到开发环境.

And what are the pitfalls to avoid? Are there any deal breaks for you? E.g., I've heard that exporting/importing the Cassandra data is very difficult, making me wonder if that's going to hinder syncing production data to development environment.

顺便说一句,很难找到关于 Cassandra 的好教程,我只有一个 http://arin.me/code/wtf-is-a-supercolumn-cassandra-data-model 仍然非常基础.

BTW, it's very hard to find good tutorials on Cassandra, the only one I have http://arin.me/code/wtf-is-a-supercolumn-cassandra-data-model is still pretty basic.

谢谢.

推荐答案

对我来说,主要是决定是使用 OrderedPartitioner 还是 RandomPartitioner.

For me, the main thing is a decision whether to use the OrderedPartitioner or RandomPartitioner.

如果您使用 RandomPartitioner,则无法进行范围扫描.这意味着您必须知道任何活动的确切密钥,包括清理旧数据.

If you use the RandomPartitioner, range scans are not possible. This means that you must know the exact key for any activity, INCLUDING CLEANING UP OLD DATA.

所以如果你有很多流失,除非你有一些神奇的方法来确切地知道你为哪些键插入了东西,使用随机分区器你很容易丢失"东西,这会导致磁盘空间泄漏并将最终消耗所有存储空间.

So if you've got a lot of churn, unless you have some magic way of knowing exactly which keys you've inserted stuff for, using the random partitioner you can easily "lose" stuff, which causes a disc space leak and will eventually consume all storage.

另一方面,您可以询问有序分区器我在 A 和 B 之间的列族 X 中有哪些键"?- 它会告诉你.然后你可以清理它们.

On the other hand, you can ask the ordered partitioner "what keys do I have in Column Family X between A and B" ? - and it'll tell you. You can then clean them up.

但是,也有缺点.由于 Cassandra 不做自动负载均衡,如果你使用有序分区器,很可能你所有的数据最终只会在一两个节点中,而其他节点没有,这意味着你会浪费资源.

However, there is a downside as well. As Cassandra doesn't do automatic load balancing, if you use the ordered partitioner, in all likelihood all your data will end up in just one or two nodes and none in the others, which means you'll waste resources.

对此我没有任何简单的答案,除了在某些情况下您可以通过在开头放置一个简短的哈希值(您可以从其他数据源轻松枚举的内容)来获得两全其美"密钥 - 例如用户 ID 的 16 位十六进制散列 - 将为您提供 4 个十六进制数字,然后是您真正想要使用的任何密钥.

I don't have any easy answer for this, except you can get "best of both worlds" in some cases by putting a short hash value (of something you can enumerate easily from other data sources) on the beginning of your keys - for example a 16-bit hex hash of the user ID - which will give you 4 hex digits, followed by whatever the key is you really wanted to use.

然后,如果您有最近删除的用户列表,您只需散列他们的 ID 和范围扫描即可清除与他们相关的任何内容.

Then if you had a list of recently-deleted users, you can just hash their IDs and range scan to clean up anything related to them.

下一个棘手的一点是二级索引——Cassandra 没有任何索引——所以如果你需要按 Y 查找 X,你需要在两个键下插入数据,或者有一个指针.同样,当这些指针指向的东西不存在时,可能需要清理它们,但在此基础上查询东西并不容易,因此您的应用程序需要记住.

并且应用程序错误可能会留下您忘记的孤立键,并且您将无法轻松检测它们,除非您编写一些垃圾收集器定期扫描数据库中的每个键(这将需要一段时间 - 但您可以分块进行)以检查不再需要的那些.

And application bugs may leave orphaned keys that you've forgotten about, and you'll have no way of easily detecting them, unless you write some garbage collector which periodically scans every single key in the db (this is going to take a while - but you can do it in chunks) to check for ones which aren't needed any more.

这些都不是基于实际使用情况,只是我在研究过程中发现的.我们不在生产中使用 Cassandra.

None of this is based on real usage, just what I've figured out during research. We don't use Cassandra in production.

Cassandra 现在在主干中有二级索引.

Cassandra now does have secondary indexes in trunk.

这篇关于设计 Cassandra 数据模型的最佳实践是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆