Cassandra:列出最近修改的 10 条记录 [英] Cassandra: List 10 most recently modified records

查看:44
本文介绍了Cassandra:列出最近修改的 10 条记录的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在尝试为我的数据建模时遇到问题,以便我可以有效地查询 Cassandra 以获取最近修改的最后 10 条(实际上是任意数字)记录.每条记录都有一个 last_modified_date 列,由应用程序在插入/更新记录时设置.

I'm having trouble trying to model my data such that I can efficiently query Cassandra for the last 10 (any number actually) records that were most recently modified. Each record has a last_modified_date column that is set by the application when inserting/updating the record.

我已经从这个示例代码中排除了数据列.

I've excluded the data columns from this example code.

主数据表(每条记录仅包含一行):

Main data table (contains only one row per record):

CREATE TABLE record (
    record_id int,
    last_modified_by text,
    last_modified_date timestamp,
    PRIMARY KEY (record_id)
);

解决方案 1(失败)

我尝试创建一个单独的表,该表使用了集群键顺序.

Solution 1 (Fail)

I tried to create a separate table, which used a clustering key order.

表格(每条记录一行;只插入最后修改日期):

Table (one row for each record; only inserting the last modified date):

CREATE TABLE record_by_last_modified_index (
    record_id int,
    last_modified_by text,
    last_modified_date timestamp,
    PRIMARY KEY (record_id, last_modified_date)
) WITH CLUSTERING ORDER BY (last_modified_date DESC);

查询:

SELECT * FROM record_by_last_modified_index LIMIT 10

此解决方案不起作用,因为聚类顺序仅适用于具有相同分区键的记录的排序.由于每一行都有不同的分区键 (record_id),因此查询结果不包括预期的记录.

This solution does not work since the clustering order only applies to the ordering of records with the same partition key. Since each row has a different partition key (record_id) the query results don't include the expected records.

我尝试过的另一个解决方案是简单地查询 Cassandra 的所有 record_id 和 last_modified_date 值,对它们进行排序并选择我的应用程序中的前 10 条记录.这显然效率低下,无法很好地扩展.

Another solution I have tried is to simply query Cassandra for all record_id and last_modified_date values, sort them and pick the first 10 records in my application. This is clearly inefficient and won't scale well.

我考虑的最后一个解决方案是对所有记录使用相同的分区键并使用聚类顺序来确保记录正确排序.该解决方案的问题在于,由于所有记录都具有相同的分区键,因此无法在节点之间正确分区数据.对我来说,这似乎不是入门.

One last solution, which I considered is using the same partition key for all records and using clustering order to ensure records are sorted correctly. The problem with that solution is that the data will not be correctly partitioned across the nodes since all of the records would have the same partition key. That seems like a non-starter to me.

推荐答案

我认为您正在尝试做的更多是关系数据库模型,并且在 Cassandra 中有点反模式.

I think what you're trying to do is more of a relational database model and is somewhat of an anti-pattern in Cassandra.

Cassandra 仅根据聚类列对事物进行排序,但预计排序顺序不会改变.这是因为当 memtables 作为 SSTables(Sorted String Tables)写入磁盘时,SSTables 是不可变的,不能有效地重新排序.这就是为什么不允许更新聚类列的值.

Cassandra only sorts things based on clustering columns, but the sort order isn't expected to change. This is because when memtables are written to disk as SSTables (Sorted String Tables), the SSTables are immutable and can't be re-sorted efficiently. This is why you aren't allowed to update the value of a clustering column.

如果要对聚集的行重新排序,我知道的唯一方法是删除旧行并批量插入新行.为了使其效率更低,您可能需要先读取以找出 record_id 的 last_modified_date 是什么,以便您可以将其删除.

If you want to re-sort the clustered rows, the only way I know is to delete the old row and insert a new one in a batch. To make that even more inefficient, you would probably need to first do a read to figure out what the last_modified_date was for the record_id so that you could delete it.

所以我会寻找一种不同的方法,例如将更新写为新的聚集行并将旧的留在那里(可能随着时间的推移使用 TTL 清理它们).因此,当您执行 LIMIT 查询时,您的最新更新将始终位于最前面.

So I'd look for a different approach, such as just writing the updates as new clustered rows and leave the old ones there (possibly clean them up over time using a TTL). So your newest updates would always be on top when you did a LIMIT query.

在分区方面,您需要将数据分成几个类别,以将数据分布在您的节点上.这意味着您不会对表进行全局排序,而只能在类别内进行排序,这是由于分布式模型所致.如果您真的需要全局排序,那么也许可以看看将 Cassandra 与 Spark 配对之类的东西.排序在时间和资源上非常昂贵,所以如果你真的需要它,请仔细考虑.

In terms of partitioning, you will need to break your data into a few categories to spread the data over your nodes. That means you won't get global sorting of your table, but only within categories, which is due to the distributed model. If you really need global sorting, then perhaps look at something like pairing Cassandra with Spark. Sorting is super expensive in time and resources, so think carefully if you really need it.

更新:

再考虑一下,您应该能够在 Cassandra 3.0 中使用物化视图来实现这一点.该视图将为您处理混乱的删除和插入,以重新排序聚集的行.所以这是 3.0 alpha 版本中的样子:

Thinking about this some more, you should be able to do this in Cassandra 3.0 using materialized views. The view would take care of the messy delete and insert for you, to re-order the clustered rows. So here's what it looks like in the 3.0 alpha release:

首先创建基表:

CREATE TABLE record_ids (
    record_type int,
    last_modified_date timestamp,
    record_id int,
    PRIMARY KEY(record_type, record_id));

然后创建该表的视图,使用 last_modified_date 作为聚类列:

Then create a view of that table, using last_modified_date as a clustering column:

CREATE MATERIALIZED VIEW last_modified AS
    SELECT record_type FROM record_ids
    WHERE record_type IS NOT NULL AND last_modified_date IS NOT NULL AND record_id IS NOT NULL
    PRIMARY KEY (record_type, last_modified_date, record_id)
    WITH CLUSTERING ORDER BY (last_modified_date DESC);

现在插入一些记录:

insert into record_ids (record_type, last_modified_date, record_id) VALUES ( 1, dateof(now()), 100);
insert into record_ids (record_type, last_modified_date, record_id) VALUES ( 1, dateof(now()), 200);
insert into record_ids (record_type, last_modified_date, record_id) VALUES ( 1, dateof(now()), 300);

SELECT * FROM record_ids;

 record_type | record_id | last_modified_date
-------------+-----------+--------------------------
           1 |       100 | 2015-08-14 19:41:10+0000
           1 |       200 | 2015-08-14 19:41:25+0000
           1 |       300 | 2015-08-14 19:41:41+0000

SELECT * FROM last_modified;

 record_type | last_modified_date       | record_id
-------------+--------------------------+-----------
           1 | 2015-08-14 19:41:41+0000 |       300
           1 | 2015-08-14 19:41:25+0000 |       200
           1 | 2015-08-14 19:41:10+0000 |       100

现在我们更新基表中的一条记录,并且应该看到它移动到视图中列表的顶部:

Now we update a record in the base table, and should see it move to the top of the list in the view:

UPDATE record_ids SET last_modified_date = dateof(now()) 
WHERE record_type=1 AND record_id=200;

因此在基表中,我们看到 record_id=200 的时间戳已更新:

So in the base table, we see the timestamp for record_id=200 was updated:

SELECT * FROM record_ids;

 record_type | record_id | last_modified_date
-------------+-----------+--------------------------
           1 |       100 | 2015-08-14 19:41:10+0000
           1 |       200 | 2015-08-14 19:43:13+0000
           1 |       300 | 2015-08-14 19:41:41+0000

在视图中,我们看到:

 SELECT * FROM last_modified;

 record_type | last_modified_date       | record_id
-------------+--------------------------+-----------
           1 | 2015-08-14 19:43:13+0000 |       200
           1 | 2015-08-14 19:41:41+0000 |       300
           1 | 2015-08-14 19:41:10+0000 |       100

所以您会看到 record_id=200 在视图中向上移动,如果您对该表进行限制 N,您将获得 N 个最近修改的行.

So you see that record_id=200 moved up in the view and if you do a limit N on that table, you'd get the N most recently modified rows.

这篇关于Cassandra:列出最近修改的 10 条记录的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆