允许过滤,cql 中的数据建模 [英] alllow filtering, data modeling in cql

查看:19
本文介绍了允许过滤,cql 中的数据建模的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在 cassandra 中使用和研究数据建模实践.到目前为止,我知道您需要基于执行的查询进行数据建模.然而,多个 select 要求使得数据建模更难或不可能在 1 个表上处理.所以,当你不能在1张表上处理这些需求时,你需要插入2-3张表.换句话说,您需要对 1 个操作进行多次插入.

I'm currently using and researching about data modeling practices in cassandra. So far, I get that you need have a data modeling based on the queries executed. However, multiple select requirements make data modeling even harder or impossible to handle it on 1 table. So, when you can't handle these requirements on 1 table, you need to insert 2-3 tables. In other words, you need to make multiple inserts on 1 operation.

目前,我正在处理活动结构的数据模型.我在 cassandra 上有一个带有以下 cql 的竞选表;

Currently, I'm dealing with a data model of a campaign structure. I have a campaign table on cassandra with the following cql;

CREATE TABLE campaign_users
(
    created_at timeuuid,
    campaign_id int,
    uid bigint,
    updated_at timestamp,
    PRIMARY KEY (campaign_id, uid),
    INDEX(campaign_id, created_at)
);

在这个模型中,我需要能够仅在给定时间戳的情况下进行增量导出.在 cassandra 中,有 allow 过滤 模式可以启用对二级索引的 select 查询.所以,我的增量导出的cql语句如下;

In this model, I need to be able to make incremental exports given a timestamp only. In cassandra, there is allow filtering mode that enables select queries for secondary indexes. So, my cql statement for incremental export is the following;

select campaign_id, uid 
from campaign_users
where created_at > minTimeuuid('2013-08-14 12:26:06+0000') allow filtering;

但是,如果使用允许过滤,则会出现警告说该语句具有不可预测的性能.那么,依赖 allow 过滤 是一个好习惯吗?还有什么其他选择?

However, if allow filtering is used, there is a warning saying that the statement have unpredictable performance. So, is it a good practice relying on allow filtering ? What can be other alternatives ?

推荐答案

ALLOW FILTERING 警告是因为 Cassandra 在内部跳过数据,而不是使用索引和查找.这是不可预测的,因为您不知道 Cassandra 将在返回的每一行中跳过多少数据.在最坏的情况下,您可能正在扫描所有数据以返回零行.这与没有 ALLOW FILTERING(除了 SELECT COUNT 查询)的操作形成对比,其中读取的数据与返回的数据量成线性比例.

The ALLOW FILTERING warning is because Cassandra is internally skipping over data, rather than using an index and seeking. This is unpredictable because you don't know how much data Cassandra is going to skip over per row returned. You could be scanning through all your data to return zero rows, in the worst case. This is in contrast to operations without ALLOW FILTERING (apart from SELECT COUNT queries), where the data read through scales linearly with the amount of data returned.

如果您要返回大部分数据,这没问题,因此跳过的数据不会花费太多.但是,如果您跳过了大部分数据,则会浪费大量工作.

This is OK if you're returning most of the data, so the data skipped over doesn't cost very much. But if you were skipping over most of your data a lot of work will be wasted.

另一种方法是将时间包含在主键的第一个组件中,以桶的形式.例如.您可以拥有每日存储桶并为包含您需要的数据的每一天复制您的查询.这种方法保证了 Cassandra 读取的大部分数据都是你想要的数据.问题是存储桶的所有数据(例如天)需要适合一个分区.您可以通过以某种方式对分区进行分片来解决此问题,例如在其中包含 uid 的某些方面.

The alternative is to include time in the first component of your primary key, in buckets. E.g. you could have day buckets and duplicate your queries for each day that contains data you need. This method guarantees that most of the data Cassandra reads over is data that you want. The problem is that all data for the bucket (e.g. day) needs to fit in one partition. You can fix this by sharding the partition somehow e.g. include some aspect of the uid within it.

这篇关于允许过滤,cql 中的数据建模的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆