Cassandra:列出10个最近修改的记录 [英] Cassandra: List 10 most recently modified records

查看:1230
本文介绍了Cassandra:列出10个最近修改的记录的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在尝试建模我的数据时遇到问题,我可以有效地查询Cassandra最近10次(实际上是任何数)的记录。每个记录都有一个由应用程序在插入/更新记录时设置的last_modified_date列。



我已从此示例代码中排除了数据列。



主数据表(每个记录只包含一行):

  CREATE TABLE记录(
record_id int,
last_modified_by text,
last_modified_date timestamp,
PRIMARY KEY(record_id)
);



解决方案1(失败)





表(每个记录一行;只插入最后修改的日期):

  CREATE TABLE record_by_last_modified_index(
record_id int,
last_modified_by text,
last_modified_date timestamp,
PRIMARY KEY (record_id,last_modified_date)
)WITH CLUSTERING ORDER BY(last_modified_date DESC);

查询:

  SELECT * FROM record_by_last_modified_index LIMIT 10 

此解决方案不工作,因为仅聚类顺序适用于具有相同分区键的记录的排序。由于每一行都有不同的分区键(record_id),因此查询结果不包括预期记录。



解决方案2(无效)



我尝试的另一个解决方案是简单地查询Cassandra的所有record_id和last_modified_date值,排序和挑选我的应用程序中的前10条记录。这显然效率不高,不能很好地扩展。



解决方案3



最后一个解决方案对所有记录使用相同的分区键,并使用聚类顺序来确保记录正确排序。该解决方案的问题是数据将不会在节点之间正确分区,因为所有记录将具有相同的分区键。

解决方案

我认为你想做的是更多的关系数据库模型,并且在Cassandra中有一些反模式。



Cassandra只根据聚类列排序,但排序顺序不会改变。这是因为当memtables作为SSTables(Sorted String Tables)写入磁盘时,SSTables是不可变的,不能有效地重新排序。这是为什么你不允许更新一个聚簇列的值。



如果你想对聚簇行重新排序,我知道的唯一方法是删除旧行并在批处理中插入新行。为了使它更低效,你可能需要先做一个读,找出last_modified_date是为record_id,以便你可以删除它。



所以我寻找一种不同的方法,例如将更新写为新的聚簇行,并保留旧的那些(可能使用TTL来清理它们)。因此,当您进行LIMIT查询时,您的最新更新将始终显示在顶部。



在分区方面,您需要将数据分成几个类别,数据。这意味着你不会得到你的表的全局排序,但只在类别内,这是由于分布式模型。如果你真的需要全局排序,那么可以看看像Cassandra和Spark的配对。排序是非常昂贵的时间和资源,所以仔细考虑,如果你真的需要它。



更新:



想想这一点,你应该能够在Cassandra 3.0中使用物化视图来做到这一点。该视图将处理对您的杂乱删除和插入,重新排序集群行。下面是3.0 alpha版本中的样子:



首先创建基表:

  CREATE TABLE record_ids(
record_type int,
last_modified_date timestamp,
record_id int,
PRIMARY KEY(record_type,record_id)

然后使用last_modified_date作为聚类列创建该表的视图:



CREATE MATERIALIZED VIEW last_modified AS
SELECT record_type FROM record_ids
WHERE record_type IS NOT NULL AND last_modified_date IS NOT NULL AND record_id IS NOT NULL
PRIMARY KEY(record_type,last_modified_date,record_id)
WITH CLUSTERING ORDER BY(last_modified_date DESC);

现在插入一些记录:

  insert into record_ids(record_type,last_modified_date,record_id)VALUES(1,dateof(now()),100); 
insert into record_ids(record_type,last_modified_date,record_id)VALUES(1,dateof(now()),200);
insert into record_ids(record_type,last_modified_date,record_id)VALUES(1,dateof(now()),300);

SELECT * FROM record_ids;

record_type | record_id | last_modified_date
------------- + ----------- + ------------------- -------
1 | 100 | 2015-08-14 19:41:10 + 0000
1 | 200 | 2015-08-14 19:41:25 + 0000
1 | 300 | 2015-08-14 19:41:41 + 0000

SELECT * FROM last_modified;

record_type | last_modified_date | record_id
------------- + -------------------------- + ---- -------
1 | 2015-08-14 19:41:41 + 0000 | 300
1 | 2015-08-14 19:41:25 + 0000 | 200
1 | 2015-08-14 19:41:10 + 0000 | 100

现在我们更新基表中的一条记录,并应该看到它移动到列表:

  UPDATE record_ids SET last_modified_date = dateof(now())
WHERE record_type = 1 AND record_id = 200;



因此在基表中,我们看到更新了record_id = 200的时间戳:

  SELECT * FROM record_ids; 

record_type | record_id | last_modified_date
------------- + ----------- + ------------------- -------
1 | 100 | 2015-08-14 19:41:10 + 0000
1 | 200 | 2015-08-14 19:43:13 + 0000
1 | 300 | 2015-08-14 19:41:41 + 0000

在视图中, / p>

  SELECT * FROM last_modified; 

record_type | last_modified_date | record_id
------------- + -------------------------- + ---- -------
1 | 2015-08-14 19:43:13 + 0000 | 200
1 | 2015-08-14 19:41:41 + 0000 | 300
1 | 2015-08-14 19:41:10 + 0000 | 100

所以你看到record_id = 200在视图中向上移动,该表中,您将获得最近修改的N行。


I'm having trouble trying to model my data such that I can efficiently query Cassandra for the last 10 (any number actually) records that were most recently modified. Each record has a last_modified_date column that is set by the application when inserting/updating the record.

I've excluded the data columns from this example code.

Main data table (contains only one row per record):

CREATE TABLE record (
    record_id int,
    last_modified_by text,
    last_modified_date timestamp,
    PRIMARY KEY (record_id)
);

Solution 1 (Fail)

I tried to create a separate table, which used a clustering key order.

Table (one row for each record; only inserting the last modified date):

CREATE TABLE record_by_last_modified_index (
    record_id int,
    last_modified_by text,
    last_modified_date timestamp,
    PRIMARY KEY (record_id, last_modified_date)
) WITH CLUSTERING ORDER BY (last_modified_date DESC);

Query:

SELECT * FROM record_by_last_modified_index LIMIT 10

This solution does not work since the clustering order only applies to the ordering of records with the same partition key. Since each row has a different partition key (record_id) the query results don't include the expected records.

Solution 2 (Inefficient)

Another solution I have tried is to simply query Cassandra for all record_id and last_modified_date values, sort them and pick the first 10 records in my application. This is clearly inefficient and won't scale well.

Solution 3

One last solution, which I considered is using the same partition key for all records and using clustering order to ensure records are sorted correctly. The problem with that solution is that the data will not be correctly partitioned across the nodes since all of the records would have the same partition key. That seems like a non-starter to me.

解决方案

I think what you're trying to do is more of a relational database model and is somewhat of an anti-pattern in Cassandra.

Cassandra only sorts things based on clustering columns, but the sort order isn't expected to change. This is because when memtables are written to disk as SSTables (Sorted String Tables), the SSTables are immutable and can't be re-sorted efficiently. This is why you aren't allowed to update the value of a clustering column.

If you want to re-sort the clustered rows, the only way I know is to delete the old row and insert a new one in a batch. To make that even more inefficient, you would probably need to first do a read to figure out what the last_modified_date was for the record_id so that you could delete it.

So I'd look for a different approach, such as just writing the updates as new clustered rows and leave the old ones there (possibly clean them up over time using a TTL). So your newest updates would always be on top when you did a LIMIT query.

In terms of partitioning, you will need to break your data into a few categories to spread the data over your nodes. That means you won't get global sorting of your table, but only within categories, which is due to the distributed model. If you really need global sorting, then perhaps look at something like pairing Cassandra with Spark. Sorting is super expensive in time and resources, so think carefully if you really need it.

Update:

Thinking about this some more, you should be able to do this in Cassandra 3.0 using materialized views. The view would take care of the messy delete and insert for you, to re-order the clustered rows. So here's what it looks like in the 3.0 alpha release:

First create the base table:

CREATE TABLE record_ids (
    record_type int,
    last_modified_date timestamp,
    record_id int,
    PRIMARY KEY(record_type, record_id));

Then create a view of that table, using last_modified_date as a clustering column:

CREATE MATERIALIZED VIEW last_modified AS
    SELECT record_type FROM record_ids
    WHERE record_type IS NOT NULL AND last_modified_date IS NOT NULL AND record_id IS NOT NULL
    PRIMARY KEY (record_type, last_modified_date, record_id)
    WITH CLUSTERING ORDER BY (last_modified_date DESC);

Now insert some records:

insert into record_ids (record_type, last_modified_date, record_id) VALUES ( 1, dateof(now()), 100);
insert into record_ids (record_type, last_modified_date, record_id) VALUES ( 1, dateof(now()), 200);
insert into record_ids (record_type, last_modified_date, record_id) VALUES ( 1, dateof(now()), 300);

SELECT * FROM record_ids;

 record_type | record_id | last_modified_date
-------------+-----------+--------------------------
           1 |       100 | 2015-08-14 19:41:10+0000
           1 |       200 | 2015-08-14 19:41:25+0000
           1 |       300 | 2015-08-14 19:41:41+0000

SELECT * FROM last_modified;

 record_type | last_modified_date       | record_id
-------------+--------------------------+-----------
           1 | 2015-08-14 19:41:41+0000 |       300
           1 | 2015-08-14 19:41:25+0000 |       200
           1 | 2015-08-14 19:41:10+0000 |       100

Now we update a record in the base table, and should see it move to the top of the list in the view:

UPDATE record_ids SET last_modified_date = dateof(now()) 
WHERE record_type=1 AND record_id=200;

So in the base table, we see the timestamp for record_id=200 was updated:

SELECT * FROM record_ids;

 record_type | record_id | last_modified_date
-------------+-----------+--------------------------
           1 |       100 | 2015-08-14 19:41:10+0000
           1 |       200 | 2015-08-14 19:43:13+0000
           1 |       300 | 2015-08-14 19:41:41+0000

And in the view, we see:

 SELECT * FROM last_modified;

 record_type | last_modified_date       | record_id
-------------+--------------------------+-----------
           1 | 2015-08-14 19:43:13+0000 |       200
           1 | 2015-08-14 19:41:41+0000 |       300
           1 | 2015-08-14 19:41:10+0000 |       100

So you see that record_id=200 moved up in the view and if you do a limit N on that table, you'd get the N most recently modified rows.

这篇关于Cassandra:列出10个最近修改的记录的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆