cassandra 获取时间范围内的所有记录 [英] cassandra get all records in time range

查看:24
本文介绍了cassandra 获取时间范围内的所有记录的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我必须使用以 (user_id, timestamp) 作为键的列族.在我的查询中,我想独立于 user_id 获取给定时间范围内的所有记录.这是确切的表架构:

I have to work with a column family that has (user_id, timestamp) as key. In my query I would like to fetch all records in a given time range independent of the user_id. This is the exact table schema:

CREATE TABLE userlog (
  user_id text,
  ts timestamp,
  action text,
  app_type text,
  channel_name text,
  channel_session_id text,
  pid text,
  region_id text,
  PRIMARY KEY (user_id, ts)
)

我试着跑

SELECT * FROM userlog  WHERE ts >= '2013-01-01 00:00:00+0200' AND  ts <= '2013-08-13 23:59:00+0200' ALLOW FILTERING;

在我的本地 cassandra 安装上运行良好,其中包含一个小数据集,但失败了

which works fine on my local cassandra installation containing a small data set but fails with

Request did not complete within rpc_timeout.

在包含所有数据的生产系统上.

on the productive system containing all the data.

是否有一个(最好是 cql)查询可以在给定的列族中顺利运行,或者我们必须更改设计?

Is there a, preferably cql, query that runs smoothly with the given column family or de we have to change the design?

推荐答案

超时是因为 Cassandra 需要比超时(默认为 10 秒)更长的时间来返回数据.对于您的查询,Cassandra 将在返回之前尝试获取整个数据集.对于多条记录,这很容易超过超时时间.

The timeout is because Cassandra is taking longer than the timeout (default is 10 seconds) to return the data. For your query, Cassandra will attempt to fetch the entire dataset before returning. For more than a few records this can easily take longer than the timeout.

对于产生大量数据的查询,您需要进行分页,例如

For queries that are producing lots of data you need to page e.g.

SELECT * FROM userlog WHERE ts >= '2013-01-01 00:00:00+0200' AND  ts <= '2013-08-13 23:59:00+0200' AND token(user_id) > previous_token LIMIT 100 ALLOW FILTERING;

其中 user_id 是之前返回的 user_id.您还需要在 ts 上进行分页,以确保您获得最后返回的 user_id 的所有记录.

where user_id is the previous user_id returned. You will also need to page on ts to guarantee you get all the records for the last user_id returned.

或者,在 Cassandra 2.0.0(刚刚发布)中,分页是透明的,因此您的原始查询应该不会超时或手动分页.

Alternatively, in Cassandra 2.0.0 (just released), paging is done transparently so your original query should work with no timeout or manual paging.

ALLOW FILTERING 表示 Cassandra 正在读取您的所有数据,但只返回指定范围内的数据.这仅在范围是大部分数据时才有效.如果您想在 e.g. 中查找记录5 分钟的时间窗口,这将非常低效.

The ALLOW FILTERING means Cassandra is reading through all your data, but only returning data within the range specified. This is only efficient if the range is most of the data. If you wanted to find records within e.g. a 5 minute time window, this would be very inefficient.

这篇关于cassandra 获取时间范围内的所有记录的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆