二级索引在Cassandra中如何工作? [英] How do secondary indexes work in Cassandra?

查看:298
本文介绍了二级索引在Cassandra中如何工作?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我有一个栏族:

  CREATE TABLE update_audit(
scopeid bigint,
formid bigint,
时间时间戳,
record_link_id bigint,
ipaddress文本,
user_zuid bigint,
值文本,
PRIMARY KEY((scopeid,formid)时间)
)WITH CLUSTERING ORDER BY(时间DESC)

record_link_id 是一个高基数列:

  CREATE INDEX update_audit_id_idx ON update_audit (record_link_id); 

CREATE INDEX update_audit_user_zuid_idx ON update_audit(user_zuid);

根据我的知识,Cassandra会创建两个隐藏的列族,如下所示:

  CREATE TABLE update_audit_id_idx(
record_link_id bigint,
scopeid bigint,
formid bigint,
time timestamp
PRIMARY KEY((record_link_id),scopeid,formid,time)
);

CREATE TABLE update_audit_user_zuid_idx(
user_zuid bigint,
scopeid bigint,
formid bigint,
time timestamp
PRIMARY KEY((user_zuid) scopeid,formid,time)
);

Cassandra二级索引实现为本地索引,而不是像普通表一样分布。



请考虑以下查询:

  select * from update_audit where scopeid = 35 and formid = 78005 and record_link_id = 9897; 




  1. 这个查询如何在Cassandra中执行? / li>
  2. 高基数列索引( record_link_id )如何影响其性能?

  3. Cassandra将会触及上述查询的所有节点? 为什么?

  4. 首先执行哪些标准,基表partition_key或辅助索引partition_key?

  5. select * from update_audit where scopeid = 35 and formid = 78005 and record_link_id = 9897;




    上述查询如何在cassandra内部工作?


    基本上,分区的所有数据 scopeid = 35 formid = 78005 将被返回,然后通过 record_link_id 索引过滤。它将为 9897 寻找 record_link_id 条目,并尝试匹配返回的行匹配的条目,其中 scopeid = 35 formid = 78005


    高基数列(record_link_id)的索引会影响上面查询的查询性能?


    高基数索引基本上为(几乎)主表中的每个条目创建一行。性能受到影响,因为Cassandra设计为对查询结果执行顺序读取。索引查询基本上强制Cassandra执行随机读取。


    cassandra会触及所有节点,因为它们的索引值的基数增加,查找查询值的时间也增加。以上查询? WHY?


    否。它应该只触及负责 scopeid = 35 formid = 78005 分区的节点。索引同样存储在本地,只包含对本地节点有效的条目。


    在高基数列上创建索引将是最快的和最好的数据模型


    问题是这种方法不能扩展,如果 update_audit 是一个大型数据集。 MVP Richard Low有一篇关于二级索引的文章( Cassandra二级索引的甜点),特别是这一点:


    如果你的表比内存大得多,查询会很慢,即使只返回几千个结果。返回潜在的数百万用户将是灾难性的,即使它看起来是一个有效的查询。



    ...



    在实践中,这意味着索引对于返回几十,也许数百个结果是最有用的。


    现在,首先限制特定分区的方法将有所帮助因为你的分区应该适合内存)。但我觉得这里更好的选择是使 record_link_id 一个聚类键,而不是依赖一个辅助索引。



    编辑


    当数百万用户缩放时,当我们提供主键


    这将取决于行的宽度。关于极低的基数索引的棘手的事情是,返回的行的%通常更大。例如,考虑一个宽行 users 表。您在查询中按分区键限制,但仍返回10,000行。如果你的索引是在 gender 之类的东西,你的查询将不得不过滤大约一半的行,这将不会很好。



    次要指标往往最适合(因为缺乏更好的描述)中间的道路基数。使用上面的宽行用户表的示例, country 状态的索引应该比 gender (假设大多数用户并非都居住在同一个国家或州)。 / p>

    Suppose I have a column family:

    CREATE TABLE update_audit (
      scopeid bigint,
      formid bigint,
      time timestamp,
      record_link_id bigint,
      ipaddress text,
      user_zuid bigint,
      value text,
      PRIMARY KEY ((scopeid, formid), time)
      ) WITH CLUSTERING ORDER BY (time DESC)
    

    With two secondary indexes, where record_link_id is a high-cardinality column:

    CREATE INDEX update_audit_id_idx ON update_audit (record_link_id);
    
    CREATE INDEX update_audit_user_zuid_idx ON update_audit (user_zuid);
    

    According to my knowledge Cassandra will create two hidden column families like so:

    CREATE TABLE update_audit_id_idx(
        record_link_id bigint,
        scopeid bigint,
        formid bigint,
        time timestamp
        PRIMARY KEY ((record_link_id), scopeid, formid, time)
    );
    
    CREATE TABLE update_audit_user_zuid_idx(
        user_zuid bigint,
        scopeid bigint,
        formid bigint,
        time timestamp
        PRIMARY KEY ((user_zuid), scopeid, formid, time)
    );
    

    Cassandra secondary indexes are implemented as local indexes rather than being distributed like normal tables. Each node only stores an index for the data it stores.

    Consider the following query:

    select * from update_audit where scopeid=35 and formid=78005 and record_link_id=9897;
    

    1. How will this query execute 'under the hood' in Cassandra?
    2. How will a high-cardinality column index (record_link_id) affect its performance?
    3. Will Cassandra will touch all nodes for the above query? Why?
    4. Which criteria will be executed first, base table partition_key or secondary index partition_key? How will Cassandra intersect these two results?

    解决方案

    select * from update_audit where scopeid=35 and formid=78005 and record_link_id=9897;
    

    How the above query will work internally in cassandra?

    Essentially, all data for partition scopeid=35 and formid=78005 will be returned, and then filtered by the record_link_id index. It will look for the record_link_id entry for 9897, and attempt to match-up entries that match the rows returned where scopeid=35 and formid=78005. The intersection of the rows for the partition keys and the index keys will be returned.

    How high-cardinality column (record_link_id)index will affect the query performance for the above query?

    High-cardinality indexes essentially create a row for (almost) each entry in the main table. Performance is affected, because Cassandra is designed to perform sequential reads for query results. An index query essentially forces Cassandra to perform random reads. As cardinality of your indexed value increases, so does the time it takes to find the queried value.

    Does cassandra will touch all nodes for the above query? WHY?

    No. It should only touch a node that is responsible for the scopeid=35 and formid=78005 partition. Indexes likewise are stored locally, only contain entries that are valid for the local node.

    creating index over high-cardinality columns will be the fastest and best data model

    The problem here is that approach does not scale, and will be slow if update_audit is a large dataset. MVP Richard Low has a great article on secondary indexes(The Sweet Spot For Cassandra Secondary Indexing), and particularly on this point:

    If your table was significantly larger than memory, a query would be very slow even to return just a few thousand results. Returning potentially millions of users would be disastrous even though it would appear to be an efficient query.

    ...

    In practice, this means indexing is most useful for returning tens, maybe hundreds of results. Bear this in mind when you next consider using a secondary index.

    Now, your approach of first restricting by a specific partition will help (as your partition should certainly fit into memory). But I feel the better-performing choice here would be to make record_link_id a clustering key, instead of relying on a secondary index.

    Edit

    How does having index on low cardinality index when there are millions of users scale even when we provide the primary key

    It will depend on how wide your rows are. The tricky thing about extremely low cardinality indexes, is that the % of rows returned is usually greater. For instance, consider a wide-row users table. You restrict by the partition key in your query, but there are still 10,000 rows returned. If your index is on something like gender, your query will have to filter-out about half of those rows, which won't perform well.

    Secondary indexes tend to work best on (for lack of a better description) "middle of the road" cardinality. Using the above example of a wide-row users table, an index on country or state should perform much better than an index on gender (assuming that most of those users don't all live in the same country or state).

    这篇关于二级索引在Cassandra中如何工作?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆