使用 CQL 在 cassandra 表中选择 2000 个最近的日志条目(最新版本) [英] Select 2000 most recent log entries in cassandra table using CQL (Latest version)

查看:14
本文介绍了使用 CQL 在 cassandra 表中选择 2000 个最近的日志条目(最新版本)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

你如何通过timeuuid查询和过滤,即假设你有一个表

How do you query and filter by timeuuid, ie assuming you have a table with

create table mystuff(uuid timeuuid primary key, stuff text);

即你是怎么做的:

select uuid, unixTimestampOf(uuid), stuff
from mystuff
order by uuid desc
limit 2000

我还希望能够获取下一个较旧的 2000 等等,但那是一个不同的问题.错误是:

I also want to be able to fetch the next older 2000 and so on, but thats a different problem. The error is:

Bad Request: ORDER BY is only supported when the partition key is restricted by an EQ or an IN.

以防万一,真正的表格实际上是这样的:

and just in case it matters, the real table is actually this:

CREATE TABLE audit_event (
  uuid timeuuid PRIMARY KEY,
  event_time bigint,
  ip text,
  level text,
  message text,
  person_uuid timeuuid
) WITH
  bloom_filter_fp_chance=0.010000 AND
  caching='KEYS_ONLY' AND
  comment='' AND
  dclocal_read_repair_chance=0.000000 AND
  gc_grace_seconds=864000 AND
  read_repair_chance=0.100000 AND
  replicate_on_write='true' AND
  populate_io_cache_on_flush='false' AND
  compaction={'class': 'SizeTieredCompactionStrategy'} AND
  compression={'sstable_compression': 'SnappyCompressor'};

推荐答案

我建议你设计你的表格有点不同.用您目前的设计来实现您的要求是相当困难的.

I would recommend that you design your table a bit differently. It would be rather hard to achieve what you're asking for with the design you have currently.

此时 audit_event 表中的每个条目都会收到另一个 uuid,在内部 Cassandra 将创建许多短行.查询此类行效率低下,而且它们是随机排序的(除非使用字节顺序分区程序,您应该避免使用 出于充分的理由).

At the moment each of your entries in the audit_event table will receive another uuid, internally Cassandra will create many short rows. Querying for such rows is inefficient, and additionally they are ordered randomly (unless using Byte Ordered Partitioner, which you should avoid for good reasons).

然而,Cassandra 非常擅长对列进行排序.如果(回到你的例子)你像这样声明你的表:

However Cassandra is pretty good at sorting columns. If (back to your example) you declared your table like this :

CREATE TABLE mystuff(
  yymmddhh varchar, 
  created timeuuid,  
  stuff text, 
  PRIMARY KEY(yymmddhh, created)
);

Cassandra 会在内部创建一行,其中键是一天中的小时,列名是实际创建的时间戳,数据是内容.这样可以提高查询效率.

Cassandra internally would create a row, where the key would be the hour of a day, column names would be the actual created timestamp and data would be the stuff. That would make it efficient to query.

考虑你有以下数据(为了方便起见,我不会去 2k 记录,但想法是一样的):

Consider you have following data (to make it easier I won't go to 2k records, but the idea is the same):

insert into mystuff(yymmddhh, created, stuff) VALUES ('13081615', now(), '90');
insert into mystuff(yymmddhh, created, stuff) VALUES ('13081615', now(), '91');
insert into mystuff(yymmddhh, created, stuff) VALUES ('13081615', now(), '92');
insert into mystuff(yymmddhh, created, stuff) VALUES ('13081615', now(), '93');
insert into mystuff(yymmddhh, created, stuff) VALUES ('13081615', now(), '94');
insert into mystuff(yymmddhh, created, stuff) VALUES ('13081616', now(), '95');
insert into mystuff(yymmddhh, created, stuff) VALUES ('13081616', now(), '96');
insert into mystuff(yymmddhh, created, stuff) VALUES ('13081616', now(), '97');
insert into mystuff(yymmddhh, created, stuff) VALUES ('13081616', now(), '98');

现在假设我们要选择最后两个条目(假设我们知道最新"行键为13081616"),您可以通过执行如下查询来实现:

Now lets say that we want to select last two entries (let's a assume for the moment that we know that the "latest" row key to be '13081616'), you can do it by executing query like this:

SELECT * FROM mystuff WHERE yymmddhh = '13081616' ORDER BY created DESC LIMIT 2 ;

这应该给你这样的东西:

which should give you something like this:

 yymmddhh | created                              | stuff
----------+--------------------------------------+-------
 13081616 | 547fe280-067e-11e3-8751-97db6b0653ce |    98
 13081616 | 547f4640-067e-11e3-8751-97db6b0653ce |    97

要获取接下来的 2 行,您必须从 created 列中获取最后一个值并将其用于下一个查询:

to get next 2 rows you have to take the last value from the created column and use it for the next query:

SELECT * FROM mystuff WHERE  yymmddhh = '13081616' 
AND created < 547f4640-067e-11e3-8751-97db6b0653ce 
ORDER BY created DESC LIMIT 2 ;

如果您收到的行数少于预期,您应该将行键更改为另一个小时.

If you received less rows than expected you should change your row key to another hour.

现在我假设我们知道要查询数据的行键.如果您记录了大量信息,我会说这不是问题 - 您可以只使用当前时间并发出查询,并将小时设置为我们现在的小时.如果我们用完行,我们可以减去一个小时并发出另一个查询.

For now I've assumed that we know the row key with which we want to query the data. If you log a lot of information I'd say that's not the problem - you can take just current time and issue a query with the hour set to what hour we have now. If we run out of rows we can subtract one hour and issue another query.

但是,如果您不知道数据在哪里,或者数据分布不均,您可以创建元数据表,在其中存储有关行键的信息:

However if you don't know where your data lies, or if it's not distributed evenly, you can create metadata table, where you'd store the information about the row keys:

CREATE TABLE mystuff_metadata(
  yyyy varchar, 
  yymmddhh varchar, 
  PRIMARY KEY(yyyy, yymmddhh)
) WITH COMPACT STORAGE;

行键将按年份组织,因此要获取当前年份的最新行键,您必须发出查询:

The row keys would be organized by a year, so to get the latest row key from the current year you'd have to issue a query:

SELECT yymmddhh 
FROM  mystuff_metadata where yyyy = '2013' 
ORDER BY yymmddhh DESC LIMIT 1;

您的审计软件必须在开始时和之后的每个小时更改时(例如在将数据插入到 mystuff 之前)对该表进行输入.

Your audit software would have to make an entry to that table on start and later on each hour change (for example before inserting data to mystuff).

这篇关于使用 CQL 在 cassandra 表中选择 2000 个最近的日志条目(最新版本)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆