alllow过滤,在cql中的数据建模 [英] alllow filtering, data modeling in cql

查看:104
本文介绍了alllow过滤,在cql中的数据建模的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在使用和研究cassandra中的数据建模实践。到目前为止,我得到你需要一个基于查询执行的数据建模。然而,多个 select 需求使得数据建模甚至更难或不可能在1表上处理它。所以,当你不能在1表上处理这些要求时,你需要插入2-3个表。换句话说,您需要在1次操作上进行多次插入。



目前,我处理的是广告系列结构的数据模型。我有一个关于cassandra的广告系列表格,其中包含以下cql:

  CREATE TABLE campaign_users 

created_at timeuuid ,
campaign_id int,
uid bigint,
updated_at timestamp,
PRIMARY KEY(campaign_id,uid),
INDEX(campaign_id,created_at)
);

在这个模型中,我需要能够使给定时间戳的增量出口。在cassandra中,有允许过滤模式,启用选择查询二级索引。所以,我的增量导出的cql语句如下:

  select campaign_id,uid 
from campaign_users
where created_at> minTimeuuid('2013-08-14 12:26:06 + 0000')允许过滤;

但是,如果使用允许过滤,则会出现警告,声明语句具有不可预测的性能。那么,这是一个很好的做法依赖允许过滤? c> c> c> c> c>

Cassandra在内部跳过数据,而不是使用索引和查找。这是不可预测的,因为你不知道Cassandra将在每一行返回的数据跳过多少数据。您可以扫描所有的数据返回零行,在最坏的情况下。这与没有 ALLOW FILTERING (除了 SELECT COUNT 查询)的操作形成对比,其中数据通过比例线性读取与返回的数据量。



如果您返回大部分数据,这是确定的,所以跳过的数据不会花费太多。



另一种方法是将时间包含在主键的第一个组件中,以桶为单位。例如。您可以拥有日期范围,并为包含所需数据的每一天复制查询。此方法保证Cassandra读取的大部分数据是您想要的数据。问题是桶的所有数据(例如天)需要适合一个分区。你可以通过sharding分区的方式来解决这个问题。包括其中的uid的一些方面。


I'm currently using and researching about data modeling practices in cassandra. So far, I get that you need have a data modeling based on the queries executed. However, multiple select requirements make data modeling even harder or impossible to handle it on 1 table. So, when you can't handle these requirements on 1 table, you need to insert 2-3 tables. In other words, you need to make multiple inserts on 1 operation.

Currently, I'm dealing with a data model of a campaign structure. I have a campaign table on cassandra with the following cql;

CREATE TABLE campaign_users
(
    created_at timeuuid,
    campaign_id int,
    uid bigint,
    updated_at timestamp,
    PRIMARY KEY (campaign_id, uid),
    INDEX(campaign_id, created_at)
);

In this model, I need to be able to make incremental exports given a timestamp only. In cassandra, there is allow filtering mode that enables select queries for secondary indexes. So, my cql statement for incremental export is the following;

select campaign_id, uid 
from campaign_users
where created_at > minTimeuuid('2013-08-14 12:26:06+0000') allow filtering;

However, if allow filtering is used, there is a warning saying that the statement have unpredictable performance. So, is it a good practice relying on allow filtering ? What can be other alternatives ?

解决方案

The ALLOW FILTERING warning is because Cassandra is internally skipping over data, rather than using an index and seeking. This is unpredictable because you don't know how much data Cassandra is going to skip over per row returned. You could be scanning through all your data to return zero rows, in the worst case. This is in contrast to operations without ALLOW FILTERING (apart from SELECT COUNT queries), where the data read through scales linearly with the amount of data returned.

This is OK if you're returning most of the data, so the data skipped over doesn't cost very much. But if you were skipping over most of your data a lot of work will be wasted.

The alternative is to include time in the first component of your primary key, in buckets. E.g. you could have day buckets and duplicate your queries for each day that contains data you need. This method guarantees that most of the data Cassandra reads over is data that you want. The problem is that all data for the bucket (e.g. day) needs to fit in one partition. You can fix this by sharding the partition somehow e.g. include some aspect of the uid within it.

这篇关于alllow过滤,在cql中的数据建模的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆