二级索引如何在 Cassandra 中工作? [英] How do secondary indexes work in Cassandra?

查看:17
本文介绍了二级索引如何在 Cassandra 中工作?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我有一个列族:

CREATE TABLE update_audit (scopeid bigint,强大的bigint,时间时间戳,record_link_id bigint,ip地址文本,user_zuid bigint,值文本,PRIMARY KEY ((scopeid, formid), time)) 与聚类顺序 BY (时间 DESC)

有两个二级索引,其中 record_link_id 是一个高基数列:

CREATE INDEX update_audit_id_idx ON update_audit (record_link_id);创建索引 update_audit_user_zuid_idx ON update_audit (user_zuid);

据我所知,Cassandra 会像这样创建两个隐藏的列族:

创建表 update_audit_id_idx(record_link_id bigint,scopeid bigint,强大的bigint,时间时间戳PRIMARY KEY ((record_link_id), scopeid, formid, time));创建表 update_audit_user_zuid_idx(user_zuid bigint,scopeid bigint,强大的bigint,时间时间戳PRIMARY KEY ((user_zuid), scopeid, formid, time));

Cassandra 二级索引作为本地索引实现,而不是像普通表那样分布.每个节点只为它存储的数据存储一个索引.

考虑以下查询:

select * from update_audit where scopeid=35 and formid=78005 and record_link_id=9897;

  1. 此查询将如何在 Cassandra 中在后台"执行?
  2. 高基数列索引 (record_link_id) 将如何影响其性能?
  3. Cassandra 是否会触及上述查询的所有节点?为什么?
  4. 首先执行哪个条件,基表 partition_key 还是二级索引 partition_key?Cassandra 将如何将这两个结果相交?

解决方案

select * from update_audit where scopeid=35 and formid=78005 and record_link_id=9897;

<块引用>

上述查询将如何在 cassandra 内部工作?

本质上,将返回分区scopeid=35formid=78005 的所有数据,然后通过record_link_id 索引进行过滤.它将查找 9897record_link_id 条目,并尝试匹配与 scopeid=35 处返回的行匹配的条目>formid=78005.将返回分区键和索引键的行的交集.

<块引用>

高基数列(record_link_id)索引对上述查询的查询性能有何影响?

高基数索引本质上为(几乎)主表中的每个条目创建一行.性能受到影响,因为 Cassandra 旨在对查询结果执行顺序读取.索引查询实质上是强制 Cassandra 执行随机读取.随着索引值的基数增加,查找查询值所需的时间也会增加.

<块引用>

cassandra 是否会触及上述查询的所有节点?为什么?

没有.它应该只接触负责 scopeid=35formid=78005 分区的节点.索引同样存储在本地,只包含对本地节点有效的条目.

<块引用>

在高基数列上创建索引将是最快和最好的数据模型

这里的问题是这种方法不能扩展,如果 update_audit 是一个大型数据集,它会很慢.MVP Richard Low 有一篇关于二级索引的很棒的文章(

基本上,分页会导致查询自行分解并返回集群以进行下一次迭代结果.超时的可能性较小,但性能会呈下降趋势,与总结果集的大小和集群中的节点数成正比.

TL;博士;请求的结果分布在更多节点上的次数越多,所需的时间就越长.

Suppose I have a column family:

CREATE TABLE update_audit (
  scopeid bigint,
  formid bigint,
  time timestamp,
  record_link_id bigint,
  ipaddress text,
  user_zuid bigint,
  value text,
  PRIMARY KEY ((scopeid, formid), time)
  ) WITH CLUSTERING ORDER BY (time DESC)

With two secondary indexes, where record_link_id is a high-cardinality column:

CREATE INDEX update_audit_id_idx ON update_audit (record_link_id);

CREATE INDEX update_audit_user_zuid_idx ON update_audit (user_zuid);

According to my knowledge Cassandra will create two hidden column families like so:

CREATE TABLE update_audit_id_idx(
    record_link_id bigint,
    scopeid bigint,
    formid bigint,
    time timestamp
    PRIMARY KEY ((record_link_id), scopeid, formid, time)
);

CREATE TABLE update_audit_user_zuid_idx(
    user_zuid bigint,
    scopeid bigint,
    formid bigint,
    time timestamp
    PRIMARY KEY ((user_zuid), scopeid, formid, time)
);

Cassandra secondary indexes are implemented as local indexes rather than being distributed like normal tables. Each node only stores an index for the data it stores.

Consider the following query:

select * from update_audit where scopeid=35 and formid=78005 and record_link_id=9897;

  1. How will this query execute 'under the hood' in Cassandra?
  2. How will a high-cardinality column index (record_link_id) affect its performance?
  3. Will Cassandra touch all nodes for the above query? Why?
  4. Which criteria will be executed first, base table partition_key or secondary index partition_key? How will Cassandra intersect these two results?

解决方案

select * from update_audit where scopeid=35 and formid=78005 and record_link_id=9897;

How the above query will work internally in cassandra?

Essentially, all data for partition scopeid=35 and formid=78005 will be returned, and then filtered by the record_link_id index. It will look for the record_link_id entry for 9897, and attempt to match-up entries that match the rows returned where scopeid=35 and formid=78005. The intersection of the rows for the partition keys and the index keys will be returned.

How high-cardinality column (record_link_id)index will affect the query performance for the above query?

High-cardinality indexes essentially create a row for (almost) each entry in the main table. Performance is affected, because Cassandra is designed to perform sequential reads for query results. An index query essentially forces Cassandra to perform random reads. As cardinality of your indexed value increases, so does the time it takes to find the queried value.

Does cassandra will touch all nodes for the above query? WHY?

No. It should only touch a node that is responsible for the scopeid=35 and formid=78005 partition. Indexes likewise are stored locally, only contain entries that are valid for the local node.

creating index over high-cardinality columns will be the fastest and best data model

The problem here is that approach does not scale, and will be slow if update_audit is a large dataset. MVP Richard Low has a great article on secondary indexes(The Sweet Spot For Cassandra Secondary Indexing), and particularly on this point:

If your table was significantly larger than memory, a query would be very slow even to return just a few thousand results. Returning potentially millions of users would be disastrous even though it would appear to be an efficient query.

...

In practice, this means indexing is most useful for returning tens, maybe hundreds of results. Bear this in mind when you next consider using a secondary index.

Now, your approach of first restricting by a specific partition will help (as your partition should certainly fit into memory). But I feel the better-performing choice here would be to make record_link_id a clustering key, instead of relying on a secondary index.

Edit

How does having index on low cardinality index when there are millions of users scale even when we provide the primary key

It will depend on how wide your rows are. The tricky thing about extremely low cardinality indexes, is that the % of rows returned is usually greater. For instance, consider a wide-row users table. You restrict by the partition key in your query, but there are still 10,000 rows returned. If your index is on something like gender, your query will have to filter-out about half of those rows, which won't perform well.

Secondary indexes tend to work best on (for lack of a better description) "middle of the road" cardinality. Using the above example of a wide-row users table, an index on country or state should perform much better than an index on gender (assuming that most of those users don't all live in the same country or state).

Edit 20180913

For your answer to 1st question "How the above query will work internally in cassandra?", do you know what's the behavior when query with pagination?

Consider the following diagram, taken from the Java Driver documentation (v3.6):

Basically, paging will cause the query to break itself up and return to the cluster for the next iteration of results. It'd be less likely to timeout, but performance will trend downward, proportional to the size of the total result set and the number of nodes in the cluster.

TL;DR; The more requested results spread over more nodes, the longer it will take.

这篇关于二级索引如何在 Cassandra 中工作?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆