Cassandra:为索引列的每个值选择第一个条目 [英] Cassandra: selecting first entry for each value of an indexed column

查看:180
本文介绍了Cassandra:为索引列的每个值选择第一个条目的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个事件表,并希望为每个用户提取第一个时间戳(列 unixtime )。
有一种方法可以用一个Cassandra查询来做到这一点?

I have a table of events and would like to extract the first timestamp (column unixtime) for each user. Is there a way to do this with a single Cassandra query?

模式如下:

CREATE TABLE events (
 id VARCHAR,
 unixtime bigint,
 u bigint,
 type VARCHAR,
 payload map<text, text>, 
 PRIMARY KEY(id)
);

CREATE INDEX events_u
  ON events (u);

CREATE INDEX events_unixtime
  ON events (unixtime);

CREATE INDEX events_type
  ON events (type);


推荐答案

根据您的架构,时间戳。如果您想要每个条目一个事件,请考虑:

According to your schema, each user will have a single time stamp. If you want one event per entry, consider:

PRIMARY KEY (id, unixtime).

假设是您的模式,用户的条目将按升序unixtime顺序存储。注意,虽然...如果它是一个无界事件流,用户有很多事件,id的分区将增长和增长。建议保持分区大小几十或几百兆。如果您预期规模较大,您需要开始某种形式的分组。

Assuming that is your schema, the entries for a user will be stored in ascending unixtime order. Be careful though...if it's an unbounded event stream and users have lots of events, the partition for the id will grow and grow. It's recommended to keep partition sizes to tens or hundreds of megs. If you anticipate larger, you'll need to start some form of bucketing.

现在,您的查询。总之,没有。如果您没有击中分区(通过指定分区键),您的查询将变为群集范围操作。有少量数据,它会工作。但是有了大量的数据,你会得到超时。如果你有当前形式的数据,那么我建议你使用Cassandra Spark连接器和Apache Spark来做你的查询。 spark连接器的另一个好处是,如果你有cassandra节点作为spark工作节点,由于局部性,你可以有效地命中二级索引,而不指定分区键(这通常会导致群集范围查询超时问题等。 )。您甚至可以使用Spark获取所需的数据并将其存储到另一个cassandra表中以便快速查询。

Now, on to your query. In a word, no. If you don't hit a partition (by specifying the partition key), your query becomes a cluster wide operation. With little data it'll work. But with lots of data, you'll get timeouts. If you do have the data in its current form, then I recommend you use the Cassandra Spark connector and Apache Spark to do your query. An added benefit of the spark connectory is that if you have cassandra nodes as spark worker nodes, due to locality, you can efficiently hit a secondary index without specifying the partition key (which would normally cause a cluster wide query with timeout issues, etc.). You could even use Spark to get the required data and store it into another cassandra table for fast querying.

这篇关于Cassandra:为索引列的每个值选择第一个条目的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆