Cassandra中的IN关系对查询有问题吗? [英] Is the IN relation in Cassandra bad for queries?

查看:2032
本文介绍了Cassandra中的IN关系对查询有问题吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在CQL中选择以下示例:

  SELECT * FROM ticket WHERE ID IN ,4)

给定ID是分区键,使用IN关系比执行多个查询更好有没有区别?

解决方案

我记得看到有人在短时间内在Cassandra用户邮件列表中回答了这个问题,但我找不到确切的邮件马上。讽刺的是,Cassandra Evangelist Rebecca Mills刚刚发布了一篇文章来解决这个问题(使用Cassandra驱动程序时应该做的事情< /a>。点#13和#22)。但是答案是是,在某些情况下,多个并行查询将比使用 IN 更快。根本原因可以在 DataStax SELECT文档中找到。


何时不使用IN



...使用IN可能会降低性能,因为
通常必须查询许多节点。例如,在具有30个节点,复制因子为3和
一致性级别LOCAL_QUORUM的单个本地
数据中心集群中,单个键查询出去到两个
节点,但是如果查询使用IN条件,则被查询的节点数量
很可能更高,最多20个节点,取决于
,其中键落在令牌范围内。


因此,基于这一点,看起来当你的集群变大时,这就成为一个问题。



因此,解决这个问题(而不必使用 IN )的最好方法是重新思考这个查询的数据模型。在不了解您的架构太多的情况下,也许有票ID 1,2,3和4共享的属性(列值)。也许使用类似级别或组(如果票是为特定场所)或甚至

基本上,虽然使用独特的高基数标识符来分割您的数据声音,这是一个好主意,它实际上使得以后更难以查询您的数据(在Cassandra)。如果你能想出一个不同的列来分区你的数据,那肯定会在这种情况下帮助你。无论如何,创建一个新的,特定的列族(表)来处理这些行的查询将是比使用 IN 或多个查询更好的方法。


Given an example of the following select in CQL:

SELECT * FROM tickets WHERE ID IN (1,2,3,4)

Given ID is a partition key, is using IN relation better than doing multiple queries or is there no difference?

解决方案

I remembered seeing someone answer this question in the Cassandra user mailing list a short while back, but I cannot find the exact message right now. Ironically, Cassandra Evangelist Rebecca Mills just posted an article that addresses this issue (Things you should be doing when using Cassandra drivers...points #13 and #22). But the answer is "yes" that in some cases, multiple, parallel queries would be faster than using an IN. The underlying reason can be found in the DataStax SELECT documentation.

When not to use IN

...Using IN can degrade performance because usually many nodes must be queried. For example, in a single, local data center cluster with 30 nodes, a replication factor of 3, and a consistency level of LOCAL_QUORUM, a single key query goes out to two nodes, but if the query uses the IN condition, the number of nodes being queried are most likely even higher, up to 20 nodes depending on where the keys fall in the token range.

So based on that, it would seem that this becomes more of a problem as your cluster gets larger.

Therefore, the best way to solve this problem (and not have to use IN at all) would be to rethink your data model for this query. Without knowing too much about your schema, perhaps there are attributes (column values) that are shared by ticket IDs 1, 2, 3, and 4. Maybe using something like level or group (if tickets are for a particular venue) or maybe even an event (id), instead.

Basically, while using a unique, high-cardinality identifier to partition your data sounds like a good idea, it actually makes it harder to query your data (in Cassandra) later on. If you could come up with a different column to partition your data on, that would certainly help you in this case. Regardless, creating a new, specific column family (table) to handle queries for those rows is going to be a better approach than using IN or multiple queries.

这篇关于Cassandra中的IN关系对查询有问题吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆