高可用性& Cassandra中二级索引的性能考虑 [英] High Availability & Performance consideration with secondary index in Cassandra

查看:258
本文介绍了高可用性& Cassandra中二级索引的性能考虑的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有 设置 :5个Cassandra节点群集,RF = 3,我为表'user'中的列执行了二级索引,

I have a Set up with: 5 Cassandra node cluster with RF =3, I performed a secondary index for a column in the table 'user',

1)根据我使用以下链接研究二级指数: https://www.datastax.com/dev/blog/cassandra-native-secondary-index-deep-dive 我明白二级索引将是存储在本地节点中。这是否意味着在五个节点集群中只有一个节点中的二级索引可用?如果不在用户表的RF = 3中,那么二级索引表可用的节点数是多少?

1) As per my study on Secondary Index using the link: https://www.datastax.com/dev/blog/cassandra-native-secondary-index-deep-dive I understood that secondary indexes will be stored in the local node. Does it mean that in the five node cluster only in one node the secondary index will be available? If not in the RF =3 for user table, In how many nodes the Secondary Index table will be available?

2)如何以下两个查询执行不同?

2) How does the following two query differ in execution?

   CREATE TABLE user(
    user_group int PRIMARY KEY,
    user_name text,
    user_phone varint
   );

  CREATE INDEX username_idx ON user (user_name);

在此表设置中,

查询1 :SELECT * FROM user WHERE user_name ='test';

Query 1 : SELECT * FROM user WHERE user_name = 'test';

查询2 :SELECT * FROM user WHERE user_group = 1 AND user_name ='test';

Query 2 : SELECT * FROM user WHERE user_group = 1 AND user_name = 'test';

上述两个查询将通过多少个节点(在5节点集群中)以及两个查询的不同之处在表现?

How many nodes (In the 5 node cluster) will the above two queries pass through for execution and How the two queries differ in performance?

编辑:

说我有一张如下表格,

Say I have a table like below,

CREATE TABLE nodestat (
    uniqueId text,
    totalCapacity int,
    physicalUsage int,
    flashMode text,
    timestamp timestamp,
    primary key (uniqueId, timestamp)) 
    with clustering order by (timestamp desc);

CREATE CUSTOM INDEX nodeIp_idx ON nodestat(flashMode)

查询3 :从nodestat中选择*其中uniqueId ='test'AND flashMode ='yes'

Query 3 : select * from nodestat where uniqueId = 'test' AND flashMode = 'yes'

所以在这种情况下,我总是只有一个分区table,so二级索引搜索与没有分区键的二级索引相比有何不同?效率如何?

So In this case, I always have only one partition in the table, so How does the secondary index search differ compare to the secondary index without partition key? How efficient is it?

推荐答案

回答你的问题1:

是吗是指在五节点集群中只有一个节点中的二级索引可用?

二级索引在集群的每个节点都可用,已构建对该节点中的数据及其对该节点的本地数据。也就是说,它只知道该特定节点中的主键。您可以将辅助索引想象成一个查找表,其中包含对该节点上主键的引用。

The secondary index is available in every node of the cluster, built upon the data in that node and its just local to that node. That is, its aware of only the primary keys in that particular node. You can imagine the secondary index to be a lookup table with references to primary keys on that node.

因此每个节点都构建自己的二级索引(在您的情况下全部为5) ,但不知道彼此的引用。

So every node builds its own secondary index (in your case all 5), but unaware of each others references.

如果用户表中没有RF = 3,那么二级索引表可用多少个节点? / em>

If not in the RF =3 for user table, In how many nodes the Secondary Index table will be available?

二级索引没有复制因子,因为它对每个节点都是本地的。由于您的数据已经被复制RF = 3,因此每个节点中的二级索引都将编入索引。

There is no replication factor for secondary indexes, since its local to every node. Since your data is already being replicated RF = 3, your secondary indexes in every node will have that indexed.

Regd your Question 2:

Regd your Question 2:

Query 1 : SELECT * FROM user WHERE user_name = 'test';

此查询将在群集中的所有节点上执行分散收集。由于二级索引是每个节点的本地索引,因此每个节点(在您的情况下全部为5)都必须执行查询 - >执行二级索引查找以找出分区键 - >然后将实际结果提取回协调器。

This query is going to perform a scatter gather on all nodes in the cluster. Since the secondary indexes are local to each node, every node (in your case all 5) has to execute the query -> perform a secondary index lookup to figure out the partition key -> then fetch the actual results back to coordinator.

随着表变大,查询通常会导致超时。在极端情况下,它可以关闭节点(就像没有分区键的select *)。 因此,Cassandra一般不鼓励二级索引和这种类型的查询(没有分区键),最好避免使用它们

As the table grows bigger, the query often results in timeout. In extreme cases it can bring down the node (just like "select *" without partition key). Hence secondary indexes and this type of query (without partition key) in general are discouraged in Cassandra and better to avoid them

Query 2 : SELECT * FROM user WHERE user_group = 1 AND user_name = 'test';

此查询与前一个查询相比表现更好,因为它在分区键上有过滤功能。在上面的表定义中没有集群列,因此该查询只会过滤主键,因为每个分区只有一行。因此二级指数没有太大改善。总的来说,它不是分散聚集类型的查询,因此表现更好。

This query will perform better compared to the previous one, as it has filter on partition key. In the table definition above there is no clustering column, so this query would just filter on primary key as there is only one row per partition. Hence there isn't much improvement with secondary index. Overall its not a scatter gather type of query and hence perform much better.

编辑解释query3

Query 3 : select * from nodestat where uniqueId = 'test' AND flashMode = 'yes'

在此查询二级索引与分区键一起使用。如果给定分区键存在1000个聚类列,并且我们希望快速缩小结果集,则此辅助索引将有所帮助。请记住,辅助索引存储整个主键(分区键+群集列引用)。所以在宽分区的情况下,这个二级索引在与分区键一起使用时证明是有用的。

In this query the secondary index is used in conjunction with partition key. This secondary index would help in case of 1000s of clustering columns exists for a given partition key and we want to quickly narrow down on the resultset. Remember the secondary index stores the entire primary key (partition key + clustering column reference). So in case of a wide partition, this secondary index proves useful when used alongside a partition key.

例如在你的情况下,假设只有一个分区uniqueId ='测试'。但是在该分区'test'中,假设有10000个不同的时间戳值(聚类列)。因此,flashMode可能有10000个不同的值。这个二级索引将有助于缩小到flashMode列,在10000次匹配中分区test中的值为是。

For example in your case, say there is only one partition uniqueId = 'test'. But within that partition 'test', say there are 10000 different timestamp values (clustering column). So potentially there could be 10000 different values for "flashMode". This secondary index will help narrow down to the "flashMode" column with value "yes" within the partition 'test' amongst that 10000 matches.

这篇关于高可用性& Cassandra中二级索引的性能考虑的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆