Cassandra中的数据模型和适当的删除策略 [英] Data model in Cassandra and proper deletion Strategy

查看:47
本文介绍了Cassandra中的数据模型和适当的删除策略的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在cassandra中有下表:

I have following table in cassandra:

CREATE TABLE article ( 
id text, 
price int, 
validFrom timestamp,     
PRIMARY KEY (id, validFrom)
) WITH CLUSTERING ORDER BY (validFrom DESC);

带有商品和历史价格信息(validFrom是新价格的时间戳).文章价格经常变动.我想查询

With articles and historical price information (validFrom is a timestamp of new price). Article price changes often. I want to query for

  1. 特定商品的历史价格.
  2. 文章的最后价格.

据我了解,我可以通过以下查询解决两个问题:从商品中选择id,id = X validFrom<的价格Y限制1; 此查询使用商品ID作为限制,查询使用分区键.由于聚类顺序基于相反顺序的validFrom时间戳,因此cassandra可以高效地执行此查询.我说对了吗?

From my understanding, I can solve both problems with following query: select id, price from article where id = X validFrom < Y limit 1; This query uses article id as restriction, query uses the partition key. Since the clustering order is based on the validFrom timestamp in reversed order, cassandra can efficient perform this query. Am I getting this right?

删除旧数据(整理)的最佳方法是什么.假设,我想删除所有带有 validFrom>的文章.20150101和validFrom<20151231 .因为我没有主键,所以即使我在validFrom上使用索引,这也效率不高,对吗?我该如何实现?

What is the best approach to delete old data (house-keeping). Let's assume, I want delete all articles with validFrom > 20150101 and validFrom < 20151231. Since I don't have a primary key, this would be inefficient, even if I use an index on validFrom, right? How can I achieve this?

推荐答案

您可以为此使用外部工具:

You can use external tools for that:

  • 使用 Spark Cassandra连接器(即使在本地模式下)也可以生成火花.代码可能如下所示(请注意,我使用的是 validfrom 作为名称,而不是 validFrom ,因为它没有在您的架构中转义):
  • Spark with Spark Cassandra Connector (even in the local mode). Code could look as following (note that I'm using validfrom as name, not validFrom, as it's not escaped in your schema):
import com.datastax.spark.connector._
val data = sc.cassandraTable("test", "article")
   .where("validfrom >= '2020-07-28T11:50:00Z' AND validfrom < '2020-07-28T12:50:00Z'")
   .select("id", "validfrom")
data.deleteFromCassandra("test", "article", keyColumns=SomeColumns("id", "validfrom"))

  • 使用 DSBulk 查找匹配的条目并将其输出进入文件(在本例中为 output.csv ),然后执行删除操作:
    • use DSBulk to do find the matching entries and output them into the file (output.csv in my case), and then perform their deletion:
    • bin/dsbulk unload -url output.csv \
        -query "SELECT id, validfrom FROM test.article WHERE token(id) > :start AND token(id) <= :end AND validFrom >= '2020-07-28T11:50:00Z' AND validFrom < '2020-07-28T12:50:00Z' ALLOW FILTERING"
      bin/dsbulk load -query "DELETE from test.article WHERE id = :id and validfrom = :validfrom" \
        -url output.csv
      

      这篇关于Cassandra中的数据模型和适当的删除策略的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆