Cassandra中的IN关系对查询有问题吗? [英] Is the IN relation in Cassandra bad for queries?
问题描述
在CQL中选择以下示例:
SELECT * FROM ticket WHERE ID IN ,4)
给定ID是分区键,使用IN关系比执行多个查询更好有没有区别?
我记得看到有人在短时间内在Cassandra用户邮件列表中回答了这个问题,但我找不到确切的邮件马上。讽刺的是,Cassandra Evangelist Rebecca Mills刚刚发布了一篇文章来解决这个问题(使用Cassandra驱动程序时应该做的事情< /a>。点#13和#22)。但是答案是是,在某些情况下,多个并行查询将比使用 IN
更快。根本原因可以在 DataStax SELECT文档中找到。
何时不使用IN
...使用IN可能会降低性能,因为
通常必须查询许多节点。例如,在具有30个节点,复制因子为3和
一致性级别LOCAL_QUORUM的单个本地
数据中心集群中,单个键查询出去到两个
节点,但是如果查询使用IN条件,则被查询的节点数量
很可能更高,最多20个节点,取决于
,其中键落在令牌范围内。
因此,基于这一点,看起来当你的集群变大时,这就成为一个问题。
因此,解决这个问题(而不必使用 IN
)的最好方法是重新思考这个查询的数据模型。在不了解您的架构太多的情况下,也许有票ID 1,2,3和4共享的属性(列值)。也许使用类似级别或组(如果票是为特定场所)或甚至
IN
或多个查询更好的方法。 Given an example of the following select in CQL:
SELECT * FROM tickets WHERE ID IN (1,2,3,4)
Given ID is a partition key, is using IN relation better than doing multiple queries or is there no difference?
I remembered seeing someone answer this question in the Cassandra user mailing list a short while back, but I cannot find the exact message right now. Ironically, Cassandra Evangelist Rebecca Mills just posted an article that addresses this issue (Things you should be doing when using Cassandra drivers...points #13 and #22). But the answer is "yes" that in some cases, multiple, parallel queries would be faster than using an IN
. The underlying reason can be found in the DataStax SELECT documentation.
When not to use IN
...Using IN can degrade performance because usually many nodes must be queried. For example, in a single, local data center cluster with 30 nodes, a replication factor of 3, and a consistency level of LOCAL_QUORUM, a single key query goes out to two nodes, but if the query uses the IN condition, the number of nodes being queried are most likely even higher, up to 20 nodes depending on where the keys fall in the token range.
So based on that, it would seem that this becomes more of a problem as your cluster gets larger.
Therefore, the best way to solve this problem (and not have to use IN
at all) would be to rethink your data model for this query. Without knowing too much about your schema, perhaps there are attributes (column values) that are shared by ticket IDs 1, 2, 3, and 4. Maybe using something like level or group (if tickets are for a particular venue) or maybe even an event (id), instead.
Basically, while using a unique, high-cardinality identifier to partition your data sounds like a good idea, it actually makes it harder to query your data (in Cassandra) later on. If you could come up with a different column to partition your data on, that would certainly help you in this case. Regardless, creating a new, specific column family (table) to handle queries for those rows is going to be a better approach than using IN
or multiple queries.
这篇关于Cassandra中的IN关系对查询有问题吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!