Cassandra通过非聚类键对结果进行排序 [英] Cassandra sorting the results by non-clustering key

查看:84
本文介绍了Cassandra通过非聚类键对结果进行排序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们在Cassandra中的用例是向博客帖子的前10名最近访问者展示。以下是Cassandra表的定义

Our use case with Cassandra is to show top 10 recent visitors of a blogpost. Following is the Cassandra table definition

CREATE TABLE blogs_by_visitor (
             blogposturl text,
             visitor text,
             visited_ts timestamp,
             PRIMARY KEY (blogposturl, visitor)
           );

现在,为了显示给定博客文章的最近十大访问者,需要有一个明确的时间戳说明中的排序依据子句。由于visted_ts不在Cassandra的群集列中,因此我们无法完成此操作。 Visited_ts不在群集列中的原因是为了避免记录重复(作为重复)访问者。主键的设计方式是为重复访问者提供最新时间戳。

Now in order to show top 10 recent visitors for a given blogpost, there needs to be an explicit "order by" clause on timestamp desc. Since visted_ts isn't part of the clustering column in Cassandra, we aren't able to get this done. The reason for visited_ts not being part of clustering column is to avoid recording repeat (read as duplicate) visitors. The primary key is designed in such a way to upsert the latest timestamp for a repeat visitor.

在RDBMS世界中,查询看起来像以下内容,而二级索引可能是

In RDBMS world the query would look like the following and a secondary index could be created with blogposturl and timestamp columns.

Select visitor from blog_table
where 
blogposturl = ?
and rownum <= 10
order by timestamp desc

当前的替代方法在我们的Cassandra应用程序中遵循的方法是获取结果,然后根据应用程序端的时间戳进行排序。但是,如果某个特定的博客文章如此受欢迎并且拥有超过100,000个访问者,该怎么办。对于那些博客,查询真的变得很慢。

An alternative currently being followed in our Cassandra application, is to obtain the results and then sort based on timestamp on the app side. But what if a particular blogpost becomes so popular and it had more than 100,000 visitors. The query becomes really slow for those blogs.

我认为二级索引在这里没有用,因为我不必担心对其进行过滤(而是

I'm thinking secondary index wouldn't be useful here, as I don't worry about filtering on it (rather just for sorting - which isn't possible).

任何关于如何对表格进行建模的想法吗?

Any idea on how we could model the table differently?

实际表中有其他列,为简单起见,在此将其减少

推荐答案

这些类型的作业是由Apache Spark或Hadoop完成。计划作业,通过时间戳为每个URL计算唯一的访客顺序,并将结果存储到cassandra中。

These type of job are done by Apache Spark or Hadoop. A schedule job which compute the unique visitor order by timestamp for each url and store the result into cassandra.

或者您可以创建材料视图,位于 blogs_by_visitor 。该表将确保唯一身份访问者,并且物化视图将基于 visited_ts 时间戳来提供结果。

Or you can create a Materialized View on top of the blogs_by_visitor. This table will make sure of unique visitor and the materialized view will oder the result based on visited_ts timestamp.

创建实例化视图:

CREATE MATERIALIZED VIEW unique_visitor AS
    SELECT *
    FROM blogs_by_visitor
    WHERE blogposturl IS NOT NULL AND visitor IS NOT NULL AND visited_ts IS NOT NULL
    PRIMARY KEY (blogposturl, visited_ts, visitor)
    WITH CLUSTERING ORDER BY (visited_ts DESC, visitor ASC);

现在,您只需选择博客帖子的10位最近唯一访问者即可。

Now you can just select the 10 recent unique visitor of a blogpost.

SELECT * FROM unique_visitor WHERE blogposturl = ? LIMIT 10;

您可以看到我没有在选择查询中指定排序顺序。因为在实例化视图架构中已指定了默认的排序顺序 visited_ts DESC

you can see that i haven't specify the sort order in select query. Because in the materialized view schema a have specified default sort order visited_ts DESC

请注意:以上架构将在物化视图中导致大量意外的墓碑生成

或者您也可以如下更改表格架构:

Or You could change your table schmea like below :

CREATE TABLE blogs_by_visitor (
     blogposturl text,
     year int,
     month int,
     day int,
     visitor text,
     visited_ts timestamp,
     PRIMARY KEY ((blogposturl, year, month, day), visitor)
);

现在在单个分区中只有少量数据,因此可以对所有访问者进行排序基于客户端中单个分区中的 visited_ts 。如果您认为一天中的访问者数量可能非常庞大,请在分区键上增加一个小时。

Now you have only a small amount of data in a single partition.So you can sort all the visitor based on visited_ts in that single partition from the client side. If you think number of visitor in a day can be huge then add hour to the partition key also.

这篇关于Cassandra通过非聚类键对结果进行排序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆