根据时间范围删除Cassandra表中的记录 [英] Delete records in Cassandra table based on time range

查看:148
本文介绍了根据时间范围删除Cassandra表中的记录的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个带模式的Cassandra表:

I have a Cassandra table with schema:

CREATE TABLE IF NOT EXISTS TestTable(
    documentId text,
    sequenceNo bigint,
    messageData blob,
    clientId text
    PRIMARY KEY(documentId, sequenceNo))
WITH CLUSTERING ORDER BY(sequenceNo DESC);

是否可以删除在给定时间范围内插入的记录?我知道在内部,Cassandra必须使用一些时间戳来跟踪每条记录的插入时间,而TTL等功能会使用该时间.

Is there a way to delete the records which were inserted between a given time range? I know internally Cassandra must be using some timestamp to track the insertion time of each record, which would be used by features like TTL.

由于在给定的架构中没有用于插入时间戳的显式列,是否有使用隐式时间戳的方法还是有更好的方法?

Since there is no explicit column for insertion timestamp in the given schema, is there a way to use the implicit timestamp or is there any better approach?

插入后永远不会更新记录.

There is never any update to the records after insertion.

推荐答案

这是一个有趣的问题...

It's an interesting question...

所有不属于主键的列都具有所谓的WriteTime,可以使用CQL的 writetime(column_name)函数(警告:不适用于集合列,并且为UDT返回null!).但是,由于CQL中没有嵌套查询,因此您将需要编写程序来获取数据,按WriteTime筛选出条目并删除WriteTime早于阈值的条目.(请注意 writetime 的值以微秒为单位,而不是CQL的 timestamp 类型以毫秒为单位).

All columns that aren't part of the primary key have so-called WriteTime that could be retrieved using the writetime(column_name) function of CQL (warning: it doesn't work with collection columns, and return null for UDTs!). But because we don't have nested queries in the CQL, you will need to write a program to fetch data, filter out entries by WriteTime, and delete entries where WriteTime is older than your threshold. (note that value of writetime is in microseconds, not milliseconds as in CQL's timestamp type).

最简单的方法是使用

The easiest way is to use Spark Cassandra Connector's RDD API, something like this:

val timestamp = someDate.toInstant.getEpochSecond * 1000L
val oldData = sc.cassandraTable(srcKeyspace, srcTable)
      .select("prk1", "prk2", "reg_col".writeTime as "writetime")
      .filter(row => row.getLong("writetime") < timestamp)
oldData.deleteFromCassandra(srcKeyspace, srcTable, 
      keyColumns = SomeColumns("prk1", "prk2"))

其中: prk1 prk2 ,...是主键的所有组件( documentId sequenceNo (根据您的情况),以及 reg_col -表中不是集合或UDT的任何常规"列(例如, clientId ).重要的是, select deleteFromCassandra 中的主键列的列表必须相同.

where: prk1, prk2, ... are all components of the primary key (documentId and sequenceNo in your case), and reg_col - any of the "regular" columns of the table that isn't collection or UDT (for example, clientId). It's important that list of the primary key columns in select and deleteFromCassandra was the same.

这篇关于根据时间范围删除Cassandra表中的记录的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆