在Cassandra中以不同的方式或顺序查询表 [英] Query a table in different ways or orderings in Cassandra

查看:227
本文介绍了在Cassandra中以不同的方式或顺序查询表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我最近开始玩Cassandra。我的理解是,在Cassandra表中定义2个键,可以是单列或复合:


  1. 分区键:跨节点分发数据

  2. 群集密钥:确定同一分区键(即同一节点内)的记录以什么顺序写入。这也是读取记录的顺序。

表中的数据将始终以相同的顺序排序,是聚类键列的顺序。因此,必须为特定的查询设计一个表。



但是如果我需要对表中的数据执行两个不同的查询。使用Cassandra时,最好的解决方法是什么?



示例场景



包含用户写的帖子的表:

  CREATE TABLE posts(
username varchar,
creation timestamp,
content varchar,
PRIMARY KEY((username),creation)
);

这个表是专为执行以下查询, p>

  SELECT * FROM posts WHERE username ='luke'[ORDER BY create DESC]; 



查询



按时间顺序获取所有帖子:



查询(1): SELECT * FROM发布ORDER BY创建;



或按照内容的字母顺序获取帖子:



查询(2): SELECT * FROM posts WHERE username ='luke'ORDER BY content;



我知道根据我创建的表不可能,但是解决这个问题的替代方法和最佳实践是什么?



解决方案< h2>

以下是我想象中产生的一些想法(只是为了表明至少我试过):




  • 使用IN子句查询,以从许多用户中选择帖子。这有助于Query(1)。使用IN子句时,如果禁用分页,则可以提取全局排序的结果。但是,当用户名数量增长时,使用IN子句会很快导致性能变差。

  • 为每个查询维护表的完整副本,每个副本使用自己的PRIMARY KEY,

  • 使用UUID作为分区键的主表。然后为每个查询创建较小的表副本,其中只包含对其自己的排序顺序有用的(关键)列以及主表的每行的UUID。较小的表将仅作为排序索引来查询UUID的列表作为结果,然后可以使用主表获取。



解决方案

我是NoSQL的新手,我想知道什么是正确/持久/高效的方式。 >

问题1:



根据你的用例,我敢打赌你可以用时间桶建模,这取决于你感兴趣的时间范围。 / p>

您可以根据用例(或更细的时间间隔)将主键设置为年,年,月或年 - 月。



基本思想是,您可以对适合您的用例的套件进行更改。例如:




  • 如果您经常需要在过去的几个月内搜索这些帖子,那么您可能想使用年份作为PK

  • 如果您通常需要搜索过去几天的帖子,那么您可能需要使用年 - 月作为PK。

  • 如果您通常需要在昨天或几天内搜索信息,那么您可能需要使用年 - 月 - 日作为您的PK。



我将以yyyy-mm-dd作为PK给出一个实例:



表格现在为:

  CREATE TABLE posts_by_creation(
creation_year int,
creation_month int,
creation_day int,
timeuuid,
username text, - 使用文本而不是varchar,它们本质上是相同的
内容文本,
PRIMARY KEY((creation_year,creation_month,creation_day),creation)

我将创建更改为timeuuid,以确保每个帖子创建事件的唯一行。



现在,我们可以插入分区键(PK):creation_year,creation_month,creation_day基于当前创建时间:

  INSERT INTO posts_by_creation(creation_year,creation_month,creation_day,creation,username,content)VALUES ,4,2,now(),'fromanator','content update1'; 
INSERT INTO posts_by_creation(creation_year,creation_month,creation_day,creation,username,content)VALUES(2016,4,2,now 'fromanator','content update2';

now()是用于生成timeUUID的CQL函数,你可能想在应用程序中生成这个,然后解析出yyyy-mm-dd的PK,然后在集群列中插入timeUUID。



一个使用这个表的用例,假设你想看到今天的所有更改,你的CQL将如下所示:

  SELECT * FROM posts_by_creation WHERE creation_year = 2016 AND creation_month = 4 AND creation_day = 2; 

或者如果您想在下午5点后查找所有更改,请:



SELECT * FROM posts_by_creation WHERE creation_year = 2016 AND creation_month = 4 AND creation_day = 2 AND creation> = minTimeuuid('2016-04-02 5:00-0600');



minTimeuuid()是另一个cql函数,它将为给定时间创建最小的timeUUID,这将保证您从那时起获得所有更改。



根据时间跨度,您可能需要查询几个不同的分区键,但不应该难以实现。



问题2:





最后,如果你没有在Cassandra 3.x +或don上,那么你必须创建另一个表或使用实例化视图来支持这个新的查询模式。不想使用物化视图,您可以使用原子批次来确保您的几个反标准化表(这是它的设计)的数据一致性。所以在你的情况下,它将是一个BATCH语句,3个插入相同的数据到支持你的查询模式的3个不同的表。


I've recently started to play around with Cassandra. My understanding is that in a Cassandra table you define 2 keys, which can be either single column or composites:

  1. The Partitioning Key: determines how to distribute data across nodes
  2. The Clustering Key: determines in which order the records of a same partitioning key (i.e. within a same node) are written. This is also the order in which the records will be read.

Data from a table will always be sorted in the same order, which is the order of the clustering key column(s). So a table must be designed for a specific query.

But what if I need to perform 2 different queries on the data from a table. What is the best way to solve this when using Cassandra ?

Example Scenario

Let's say I have a simple table containing posts that users have written :

CREATE TABLE posts (
  username varchar,
  creation timestamp,
  content varchar,
  PRIMARY KEY ((username), creation)
);

This table was "designed" to perform the following query, which works very well for me:

SELECT * FROM posts WHERE username='luke' [ORDER BY creation DESC];

Queries

But what if I need to get all posts regardless of the username, in order of time:

Query (1): SELECT * FROM posts ORDER BY creation;

Or get the posts in alphabetical order of the content:

Query (2): SELECT * FROM posts WHERE username='luke' ORDER BY content;

I know that it's not possible given the table I created, but what are the alternatives and best practices to solve this ?

Solution Ideas

Here are a few ideas spawned from my imagination (just to show that at least I tried):

  • Querying with the IN clause to select posts from many users. This could help in Query (1). When using the IN clause, you can fetch globally sorted results if you disable paging. But using the IN clause quickly leads to bad performance when the number of usernames grows.
  • Maintaining full copies of the table for each query, each copy using its own PRIMARY KEY adapted to the query it is trying to serve.
  • Having a main table with a UUID as partitioning key. Then creating smaller copies of the table for each query, which only contain the (key) columns useful for their own sort order, and the UUID for each row of the main table. The smaller tables would serve only as "sorting indexes" to query a list of UUID as result, which can then be fetched using the main table.

I'm new to NoSQL, I would just want to know what is the correct/durable/efficient way of doing this.

解决方案

Question 1:

Depending on your use case I bet you could model this with time buckets, depending on the range of times you're interested in.

You can do this by making the primary key a year,year-month, or year-month-day depending on your use case (or finer time intervals)

The basic idea is that you bucket changes for what suites your use case. For example:

  • If you often need to search these posts over months in the past, then you may want to use the year as the PK.
  • If you usually need to search the posts over several days in the past, then you may want to use a year-month as the PK.
  • If you usually need to search the post for yesterday or a couple of days, then you may want to use a year-month-day as your PK.

I'll give a fleshed out example with yyyy-mm-dd as the PK:

The table will now be:

CREATE TABLE posts_by_creation (
  creation_year int,
  creation_month int,
  creation_day int,
  creation timeuuid,
  username text,  -- using text instead of varchar, they're essentially the same
  content text,
  PRIMARY KEY ((creation_year,creation_month,creation_day), creation)
)

I changed creation to be a timeuuid to guarantee a unique row for each post creation event. If we used just a timestamp you could theoretically overwrite an existing post creation record in here.

Now we can then insert the Partition Key (PK): creation_year, creation_month, creation_day based on the current creation time:

INSERT INTO posts_by_creation (creation_year, creation_month, creation_day, creation, username, content) VALUES (2016, 4, 2, now() , 'fromanator', 'content update1';
INSERT INTO posts_by_creation (creation_year, creation_month, creation_day, creation, username, content) VALUES (2016, 4, 2, now() , 'fromanator', 'content update2';

now() is a CQL function to generate a timeUUID, you would probably want to generate this in the application instead, and parse out the yyyy-mm-dd for the PK and then insert the timeUUID in the clustered column.

For a usage case using this table, let's say you wanted to see all of the changes today, your CQL would look like:

SELECT * FROM posts_by_creation WHERE creation_year = 2016 AND creation_month = 4 AND creation_day = 2;

Or if you wanted to find all of the changes today after 5pm central:

SELECT * FROM posts_by_creation WHERE creation_year = 2016 AND creation_month = 4 AND creation_day = 2 AND creation >= minTimeuuid('2016-04-02 5:00-0600') ;

minTimeuuid() is another cql function, it will create the smallest possible timeUUID for the given time, this will guarantee that you get all of the changes from that time.

Depending on the time spans you may need to query a few different partition keys, but it shouldn't be that hard to implement. Also you would want to change your creation column to a timeuuid for your other table.

Question 2:

You'll have to create another table or use materialized views to support this new query pattern, just like you thought.

Lastly if your not on Cassandra 3.x+ or don't want to use materialized views you can use Atomic batches to ensure data consistency across your several de-normalized tables (that's what it was designed for). So in your case it would be a BATCH statement with 3 inserts of the same data to 3 different tables that support your query patterns.

这篇关于在Cassandra中以不同的方式或顺序查询表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆