Cassandra通过非聚类键对结果进行排序 [英] Cassandra sorting the results by non-clustering key

查看：84 发布时间：2020/9/29 21:00:02 cassandra datastax cassandra-3.0

本文介绍了Cassandra通过非聚类键对结果进行排序的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我们在Cassandra中的用例是向博客帖子的前10名最近访问者展示。以下是Cassandra表的定义

Our use case with Cassandra is to show top 10 recent visitors of a blogpost. Following is the Cassandra table definition

CREATE TABLE blogs_by_visitor (
             blogposturl text,
             visitor text,
             visited_ts timestamp,
             PRIMARY KEY (blogposturl, visitor)
           );

现在，为了显示给定博客文章的最近十大访问者，需要有一个明确的时间戳说明中的排序依据子句。由于visted_ts不在Cassandra的群集列中，因此我们无法完成此操作。 Visited_ts不在群集列中的原因是为了避免记录重复（作为重复）访问者。主键的设计方式是为重复访问者提供最新时间戳。

Now in order to show top 10 recent visitors for a given blogpost, there needs to be an explicit "order by" clause on timestamp desc. Since visted_ts isn't part of the clustering column in Cassandra, we aren't able to get this done. The reason for visited_ts not being part of clustering column is to avoid recording repeat (read as duplicate) visitors. The primary key is designed in such a way to upsert the latest timestamp for a repeat visitor.

在RDBMS世界中，查询看起来像以下内容，而二级索引可能是

In RDBMS world the query would look like the following and a secondary index could be created with blogposturl and timestamp columns.

Select visitor from blog_table
where 
blogposturl = ?
and rownum <= 10
order by timestamp desc

当前的替代方法在我们的Cassandra应用程序中遵循的方法是获取结果，然后根据应用程序端的时间戳进行排序。但是，如果某个特定的博客文章如此受欢迎并且拥有超过100,000个访问者，该怎么办。对于那些博客，查询真的变得很慢。

An alternative currently being followed in our Cassandra application, is to obtain the results and then sort based on timestamp on the app side. But what if a particular blogpost becomes so popular and it had more than 100,000 visitors. The query becomes really slow for those blogs.

我认为二级索引在这里没有用，因为我不必担心对其进行过滤（而是

I'm thinking secondary index wouldn't be useful here, as I don't worry about filtering on it (rather just for sorting - which isn't possible).

任何关于如何对表格进行建模的想法吗？

Any idea on how we could model the table differently?

实际表中有其他列，为简单起见，在此将其减少

推荐答案

这些类型的作业是由Apache Spark或Hadoop完成。计划作业，通过时间戳为每个URL计算唯一的访客顺序，并将结果存储到cassandra中。

These type of job are done by Apache Spark or Hadoop. A schedule job which compute the unique visitor order by timestamp for each url and store the result into cassandra.

或者您可以创建材料视图，位于 blogs_by_visitor 。该表将确保唯一身份访问者，并且物化视图将基于 visited_ts 时间戳来提供结果。


Or you can create a Materialized View on top of the blogs_by_visitor. This table will make sure of unique visitor and the materialized view will oder the result based on visited_ts timestamp.
创建实例化视图：
CREATE MATERIALIZED VIEW unique_visitor AS
    SELECT *
    FROM blogs_by_visitor
    WHERE blogposturl IS NOT NULL AND visitor IS NOT NULL AND visited_ts IS NOT NULL
    PRIMARY KEY (blogposturl, visited_ts, visitor)
    WITH CLUSTERING ORDER BY (visited_ts DESC, visitor ASC);

现在，您只需选择博客帖子的10位最近唯一访问者即可。
Now you can just select the 10 recent unique visitor of a blogpost.
SELECT * FROM unique_visitor WHERE blogposturl = ? LIMIT 10;

您可以看到我没有在选择查询中指定排序顺序。因为在实例化视图架构中已指定了默认的排序顺序 visited_ts DESC  
you can see that i haven't specify the sort order in select query. Because in the materialized view schema a have specified default sort order visited_ts DESC
请注意：以上架构将在物化视图中导致大量意外的墓碑生成 
或者您也可以如下更改表格架构：
Or You could change your table schmea like below : 
CREATE TABLE blogs_by_visitor (
     blogposturl text,
     year int,
     month int,
     day int,
     visitor text,
     visited_ts timestamp,
     PRIMARY KEY ((blogposturl, year, month, day), visitor)
);

现在在单个分区中只有少量数据，因此可以对所有访问者进行排序基于客户端中单个分区中的 visited_ts 。如果您认为一天中的访问者数量可能非常庞大，请在分区键上增加一个小时。
Now you have only a small amount of data in a single partition.So you can sort all the visitor based on visited_ts in that single partition from the client side. If you think number of visitor in a day can be huge then add hour to the partition key also.

                        这篇关于Cassandra通过非聚类键对结果进行排序的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

Cassandra通过非聚类键对结果进行排序 [英] Cassandra sorting the results by non-clustering key

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Cassandra通过非聚类键对结果进行排序 [英] Cassandra sorting the results by non-clustering key

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭