高可用性Cassandra 中二级索引的性能考虑 [英] High Availability & Performance consideration with secondary index in Cassandra

查看:34
本文介绍了高可用性Cassandra 中二级索引的性能考虑的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个设置:RF = 3 的 5 Cassandra 节点集群,我对表用户"中的列执行了二级索引,

1) 根据我使用链接对二级索引的研究:https://www.datastax.com/dev/blog/cassandra-native-secondary-index-deep-dive 我知道二级索引将存储在本地节点中.是不是意味着在五节点集群中只有一个节点可以使用二级索引?如果 user 表的 RF =3 中没有,那么二级索引表将在多少个节点中可用?

2) 以下两个查询在执行上有何不同?

 创建表用户(user_group int PRIMARY KEY,用户名文本,user_phone 变量);CREATE INDEX username_idx ON user (user_name);

在这个表设置中,

查询 1 : SELECT * FROM user WHERE user_name = 'test';

查询 2 : SELECT * FROM user WHERE user_group = 1 AND user_name = 'test';

以上两个查询会经过多少个节点(在5个节点的集群中)执行,这两个查询在性能上有什么不同?

假设我有一张像下面这样的表格,

CREATE TABLE nodestat (唯一标识文本,总容量 int,物理使用整数,flashMode 文本,时间戳时间戳,主键(uniqueId,时间戳))使用聚类顺序(时间戳记);创建自定义索引 nodeIp_idx ON nodestat(flashMode)

查询 3 : select * from nodestat where uniqueId = 'test' AND flashMode = 'yes'

所以在这种情况下,我总是在表中只有一个分区,那么二级索引搜索与没有分区键的二级索引相比有何不同?它的效率如何?

解决方案

Regd your Question 1:

是不是意味着五节点集群中只有一个节点二级索引可用?

二级索引在集群的每个节点中都可用,建立在该节点中的数据之上,并且它只是该节点的本地数据.也就是说,它只知道该特定节点中的主键.您可以将二级索引想象成一个查找表,其中引用了该节点上的主键.

所以每个节点都建立自己的二级索引(在你的例子中是所有 5 个),但不知道彼此的引用.

如果不是在用户表的 RF =3 中,二级索引表将在多少个节点中可用?

二级索引没有复制因子,因为它对每个节点都是本地的.由于您的数据已经被复制 RF = 3,因此您在每个节点中的二级索引都会将该索引编入索引.

请注意您的问题 2:

Query 1 : SELECT * FROM user WHERE user_name = 'test';

此查询将在集群中的所有节点上执行分散收集.由于二级索引对于每个节点都是本地的,因此每个节点(在您的情况下都是 5 个节点)必须执行查询 -> 执行二级索引查找以找出分区键 -> 然后将实际结果取回协调器.

随着表变大,查询经常会导致超时.在极端情况下,它可以关闭节点(就像没有分区键的select *"一样).因此二级索引和这种类型的查询(没有分区键)一般在 Cassandra 中是不鼓励的,最好避免它们

Query 2 : SELECT * FROM user WHERE user_group = 1 AND user_name = 'test';

与前一个查询相比,此查询的性能更好,因为它对分区键进行了过滤.在上面的表定义中没有聚集列,所以这个查询只会过滤主键,因为每个分区只有一行.因此二级索引没有太大的改进.总体而言,它不是分散收集类型的查询,因此性能要好得多.

编辑解释查询3

Query 3 : select * from nodestat where uniqueId = 'test' AND flashMode = 'yes'

在此查询中,二级索引与分区键结合使用.如果给定分区键存在 1000 个聚集列,并且我们希望快速缩小结果集的范围,则此二级索引将有所帮助.记住二级索引存储了整个主键(分区键+聚集列引用).因此,对于宽分区,此二级索引与分区键一起使用时证明很有用.

例如在您的情况下,假设只有一个分区 uniqueId = 'test'.但是在该分区测试"中,假设有 10000 个不同的时间戳值(聚类列).因此,flashMode"可能有 10000 个不同的值.此二级索引将帮助缩小到 10000 个匹配项中的test"分区中值为yes"的flashMode"列.

I have a Set up with: 5 Cassandra node cluster with RF =3, I performed a secondary index for a column in the table 'user',

1) As per my study on Secondary Index using the link: https://www.datastax.com/dev/blog/cassandra-native-secondary-index-deep-dive I understood that secondary indexes will be stored in the local node. Does it mean that in the five node cluster only in one node the secondary index will be available? If not in the RF =3 for user table, In how many nodes the Secondary Index table will be available?

2) How does the following two query differ in execution?

   CREATE TABLE user(
    user_group int PRIMARY KEY,
    user_name text,
    user_phone varint
   );

  CREATE INDEX username_idx ON user (user_name);

In this table setup,

Query 1 : SELECT * FROM user WHERE user_name = 'test';

Query 2 : SELECT * FROM user WHERE user_group = 1 AND user_name = 'test';

How many nodes (In the 5 node cluster) will the above two queries pass through for execution and How the two queries differ in performance?

Edited :

Say I have a table like below,

CREATE TABLE nodestat (
    uniqueId text,
    totalCapacity int,
    physicalUsage int,
    flashMode text,
    timestamp timestamp,
    primary key (uniqueId, timestamp)) 
    with clustering order by (timestamp desc);

CREATE CUSTOM INDEX nodeIp_idx ON nodestat(flashMode)

Query 3 : select * from nodestat where uniqueId = 'test' AND flashMode = 'yes'

So In this case, I always have only one partition in the table, so How does the secondary index search differ compare to the secondary index without partition key? How efficient is it?

解决方案

Regd your Question 1:

Does it mean that in the five node cluster only in one node the secondary index will be available?

The secondary index is available in every node of the cluster, built upon the data in that node and its just local to that node. That is, its aware of only the primary keys in that particular node. You can imagine the secondary index to be a lookup table with references to primary keys on that node.

So every node builds its own secondary index (in your case all 5), but unaware of each others references.

If not in the RF =3 for user table, In how many nodes the Secondary Index table will be available?

There is no replication factor for secondary indexes, since its local to every node. Since your data is already being replicated RF = 3, your secondary indexes in every node will have that indexed.

Regd your Question 2:

Query 1 : SELECT * FROM user WHERE user_name = 'test';

This query is going to perform a scatter gather on all nodes in the cluster. Since the secondary indexes are local to each node, every node (in your case all 5) has to execute the query -> perform a secondary index lookup to figure out the partition key -> then fetch the actual results back to coordinator.

As the table grows bigger, the query often results in timeout. In extreme cases it can bring down the node (just like "select *" without partition key). Hence secondary indexes and this type of query (without partition key) in general are discouraged in Cassandra and better to avoid them

Query 2 : SELECT * FROM user WHERE user_group = 1 AND user_name = 'test';

This query will perform better compared to the previous one, as it has filter on partition key. In the table definition above there is no clustering column, so this query would just filter on primary key as there is only one row per partition. Hence there isn't much improvement with secondary index. Overall its not a scatter gather type of query and hence perform much better.

edited to explain query3

Query 3 : select * from nodestat where uniqueId = 'test' AND flashMode = 'yes'

In this query the secondary index is used in conjunction with partition key. This secondary index would help in case of 1000s of clustering columns exists for a given partition key and we want to quickly narrow down on the resultset. Remember the secondary index stores the entire primary key (partition key + clustering column reference). So in case of a wide partition, this secondary index proves useful when used alongside a partition key.

For example in your case, say there is only one partition uniqueId = 'test'. But within that partition 'test', say there are 10000 different timestamp values (clustering column). So potentially there could be 10000 different values for "flashMode". This secondary index will help narrow down to the "flashMode" column with value "yes" within the partition 'test' amongst that 10000 matches.

这篇关于高可用性Cassandra 中二级索引的性能考虑的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆