使用CQL(最新版本)在cassandra表中选择2000个最近的日志条目 [英] Select 2000 most recent log entries in cassandra table using CQL (Latest version)

查看:525
本文介绍了使用CQL(最新版本)在cassandra表中选择2000个最近的日志条目的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何通过timeuuid查询和过滤,即假设您有一个包含

How do you query and filter by timeuuid, ie assuming you have a table with

create table mystuff(uuid timeuuid primary key, stuff text);

即如何做:

select uuid, unixTimestampOf(uuid), stuff
from mystuff
order by uuid desc
limit 2000

我也想要能够获取下一个更旧的2000等等,但这是一个不同的问题。错误是:

I also want to be able to fetch the next older 2000 and so on, but thats a different problem. The error is:

Bad Request: ORDER BY is only supported when the partition key is restricted by an EQ or an IN.

,只是为了万一重要,实际表格是:

and just in case it matters, the real table is actually this:

CREATE TABLE audit_event (
  uuid timeuuid PRIMARY KEY,
  event_time bigint,
  ip text,
  level text,
  message text,
  person_uuid timeuuid
) WITH
  bloom_filter_fp_chance=0.010000 AND
  caching='KEYS_ONLY' AND
  comment='' AND
  dclocal_read_repair_chance=0.000000 AND
  gc_grace_seconds=864000 AND
  read_repair_chance=0.100000 AND
  replicate_on_write='true' AND
  populate_io_cache_on_flush='false' AND
  compaction={'class': 'SizeTieredCompactionStrategy'} AND
  compression={'sstable_compression': 'SnappyCompressor'};


推荐答案

我建议您。

现在,您在中输入的每一个条目, audit_event 表会收到另一个 uuid ,内部Cassandra会创建很多短行。查询这些行是低效的,另外它们是随机排序的(除非使用字节有序分区器,这应该避免因为很好的原因)。

At the moment each of your entries in the audit_event table will receive another uuid, internally Cassandra will create many short rows. Querying for such rows is inefficient, and additionally they are ordered randomly (unless using Byte Ordered Partitioner, which you should avoid for good reasons).

但是Cassandra对排序列非常好。如果(回到你的例子)你这样声明你的表:

However Cassandra is pretty good at sorting columns. If (back to your example) you declared your table like this :

CREATE TABLE mystuff(
  yymmddhh varchar, 
  created timeuuid,  
  stuff text, 
  PRIMARY KEY(yymmddhh, created)
);

Cassandra会在内部创建一行,其中键是一天中的小时,列名称是实际创建的时间戳和数据将是东西。这将有效地查询。

Cassandra internally would create a row, where the key would be the hour of a day, column names would be the actual created timestamp and data would be the stuff. That would make it efficient to query.

考虑你有以下数据(为了更容易我不会去2k记录,但想法是一样的):

Consider you have following data (to make it easier I won't go to 2k records, but the idea is the same):

insert into mystuff(yymmddhh, created, stuff) VALUES ('13081615', now(), '90');
insert into mystuff(yymmddhh, created, stuff) VALUES ('13081615', now(), '91');
insert into mystuff(yymmddhh, created, stuff) VALUES ('13081615', now(), '92');
insert into mystuff(yymmddhh, created, stuff) VALUES ('13081615', now(), '93');
insert into mystuff(yymmddhh, created, stuff) VALUES ('13081615', now(), '94');
insert into mystuff(yymmddhh, created, stuff) VALUES ('13081616', now(), '95');
insert into mystuff(yymmddhh, created, stuff) VALUES ('13081616', now(), '96');
insert into mystuff(yymmddhh, created, stuff) VALUES ('13081616', now(), '97');
insert into mystuff(yymmddhh, created, stuff) VALUES ('13081616', now(), '98');

现在让我们说,我们要选择最后两个条目最新行键为13081616),您可以通过执行以下查询来执行此操作:

Now lets say that we want to select last two entries (let's a assume for the moment that we know that the "latest" row key to be '13081616'), you can do it by executing query like this:

SELECT * FROM mystuff WHERE yymmddhh = '13081616' ORDER BY created DESC LIMIT 2 ;

它应该给你这样的:

 yymmddhh | created                              | stuff
----------+--------------------------------------+-------
 13081616 | 547fe280-067e-11e3-8751-97db6b0653ce |    98
 13081616 | 547f4640-067e-11e3-8751-97db6b0653ce |    97

得到下面两行,你必须从创建列,并将其用于下一个查询:

to get next 2 rows you have to take the last value from the created column and use it for the next query:

SELECT * FROM mystuff WHERE  yymmddhh = '13081616' 
AND created < 547f4640-067e-11e3-8751-97db6b0653ce 
ORDER BY created DESC LIMIT 2 ;

如果您收到的行少于预期,应将行键更改为另一小时。

If you received less rows than expected you should change your row key to another hour.

现在我假设我们知道要查询的行键数据。如果你记录了很多信息,我会说这不是问题 - 你可以只是当前时间,并发出一个查询,小时设置为我们现在的时间。如果我们用完了行,我们可以减去一个小时并发出另一个查询。

For now I've assumed that we know the row key with which we want to query the data. If you log a lot of information I'd say that's not the problem - you can take just current time and issue a query with the hour set to what hour we have now. If we run out of rows we can subtract one hour and issue another query.

但是,如果你不知道你的数据在哪里,您可以创建元数据表,您可以在其中存储有关行键的信息:

However if you don't know where your data lies, or if it's not distributed evenly, you can create metadata table, where you'd store the information about the row keys:

CREATE TABLE mystuff_metadata(
  yyyy varchar, 
  yymmddhh varchar, 
  PRIMARY KEY(yyyy, yymmddhh)
) WITH COMPACT STORAGE;

行键将按年份组织,因此从当前年份获取最新的行键您必须发出查询:

The row keys would be organized by a year, so to get the latest row key from the current year you'd have to issue a query:

SELECT yymmddhh 
FROM  mystuff_metadata where yyyy = '2013' 
ORDER BY yymmddhh DESC LIMIT 1;

您的审计软件必须在开始和稍后每小时更改例如在将数据插入到 mystuff 之前)。

Your audit software would have to make an entry to that table on start and later on each hour change (for example before inserting data to mystuff).

这篇关于使用CQL(最新版本)在cassandra表中选择2000个最近的日志条目的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆