根据时间范围删除Cassandra表中的记录 [英] Delete records in Cassandra table based on time range
问题描述
我有一个带模式的Cassandra表:
I have a Cassandra table with schema:
CREATE TABLE IF NOT EXISTS TestTable(
documentId text,
sequenceNo bigint,
messageData blob,
clientId text
PRIMARY KEY(documentId, sequenceNo))
WITH CLUSTERING ORDER BY(sequenceNo DESC);
是否可以删除在给定时间范围内插入的记录?我知道在内部,Cassandra必须使用一些时间戳来跟踪每条记录的插入时间,而TTL等功能会使用该时间.
Is there a way to delete the records which were inserted between a given time range? I know internally Cassandra must be using some timestamp to track the insertion time of each record, which would be used by features like TTL.
由于在给定的架构中没有用于插入时间戳的显式列,是否有使用隐式时间戳的方法还是有更好的方法?
Since there is no explicit column for insertion timestamp in the given schema, is there a way to use the implicit timestamp or is there any better approach?
插入后永远不会更新记录.
There is never any update to the records after insertion.
推荐答案
这是一个有趣的问题...
It's an interesting question...
所有不属于主键的列都具有所谓的WriteTime,可以使用CQL的 writetime(column_name)
函数(警告:不适用于集合列,并且为UDT返回null!).但是,由于CQL中没有嵌套查询,因此您将需要编写程序来获取数据,按WriteTime筛选出条目并删除WriteTime早于阈值的条目.(请注意, writetime
的值以微秒为单位,而不是CQL的 timestamp
类型以毫秒为单位).
All columns that aren't part of the primary key have so-called WriteTime that could be retrieved using the writetime(column_name)
function of CQL (warning: it doesn't work with collection columns, and return null for UDTs!). But because we don't have nested queries in the CQL, you will need to write a program to fetch data, filter out entries by WriteTime, and delete entries where WriteTime is older than your threshold. (note that value of writetime
is in microseconds, not milliseconds as in CQL's timestamp
type).
The easiest way is to use Spark Cassandra Connector's RDD API, something like this:
val timestamp = someDate.toInstant.getEpochSecond * 1000L
val oldData = sc.cassandraTable(srcKeyspace, srcTable)
.select("prk1", "prk2", "reg_col".writeTime as "writetime")
.filter(row => row.getLong("writetime") < timestamp)
oldData.deleteFromCassandra(srcKeyspace, srcTable,
keyColumns = SomeColumns("prk1", "prk2"))
其中: prk1
, prk2
,...是主键的所有组件( documentId
和 sequenceNo
(根据您的情况),以及 reg_col
-表中不是集合或UDT的任何常规"列(例如, clientId
).重要的是, select
和 deleteFromCassandra
中的主键列的列表必须相同.
where: prk1
, prk2
, ... are all components of the primary key (documentId
and sequenceNo
in your case), and reg_col
- any of the "regular" columns of the table that isn't collection or UDT (for example, clientId
). It's important that list of the primary key columns in select
and deleteFromCassandra
was the same.
这篇关于根据时间范围删除Cassandra表中的记录的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!