Cassandra:选择分区键 [英] Cassandra: choosing a Partition Key

查看:501
本文介绍了Cassandra:选择分区键的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我不确定是否更好,性能方面,使用一个非常共享的列值(如国家)作为复合主键的分区键或相当唯一的列值(如 Last_Name )。



查看。对于辅助索引,确实有不好的是有非常独特的值,然而对于主键中的组件,这取决于我们正在关注的组件。在主键中,我们有以下组件:


PRIMARY KEY(分区键,聚簇key_1 ...聚类key_n)


分区键用于跨不同节点分布数据,如果您希望节点平衡(即跨每个节点分布良好的数据)希望您的分区键尽可能随机。这就是为什么你使用UUID的例子。



聚类键用于排序,以便查询具有特定聚类键的列可以更高效。



cassandra.apache.org/doc/cql3/CQL.html#createTablepartitionClustering\"> cql docs 对正在发生的情况有很好的解释。


I'm undecided whether it's better, performance-wise, to use a very commonly shared column value (like Country) as partition key for a compound primary key or a rather unique column value (like Last_Name).

Looking at Cassandra 1.2's documentation about indexes I get this:

"When to use an index: Cassandra's built-in indexes are best on a table having many rows that contain the indexed value. The more unique values that exist in a particular column, the more overhead you will have, on average, to query and maintain the index. For example, suppose you had a user table with a billion users and wanted to look up users by the state they lived in. Many users will share the same column value for state (such as CA, NY, TX, etc.). This would be a good candidate for an index."

"When not to use an index: Do not use an index to query a huge volume of records for a small number of results. For example, if you create an index on a column that has many distinct values, a query between the fields will incur many seeks for very few results. In the table with a billion users, looking up users by their email address (a value that is typically unique for each user) instead of by their state, is likely to be very inefficient. It would probably be more efficient to manually maintain the table as a form of an index instead of using the Cassandra built-in index. For columns containing unique data, it is sometimes fine performance-wise to use an index for convenience, as long as the query volume to the table having an indexed column is moderate and not under constant load."

Looking at the examples from CQL's SELECT for

"Querying compound primary keys and sorting results", I see something like a UUID being used as partition key... which would indicate that it's preferable to use something rather unique?

解决方案

Indexing in the documentation you wrote up refers to secondary indexes. In cassandra there is a difference between the primary and secondary indexes. For a secondary index it would indeed be bad to have very unique values, however for the components in a primary key this depends on what component we are focusing on. In the primary key we have these components:

PRIMARY KEY(partitioning key, clustering key_1 ... clustering key_n)

The partitioning key is used to distribute data across different nodes, and if you want your nodes to be balanced (i.e. well distributed data across each node) then you want your partitioning key to be as random as possible. That is why the example you have uses UUIDs.

The clustering key is used for ordering so that querying columns with a particular clustering key can be more efficient. That is where you want your values to not be unique and where there would be a performance hit if unique rows were frequent.

The cql docs have a good explanation of what is going on.

这篇关于Cassandra:选择分区键的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆