使用不同的TTL火花写入Cassandra [英] Spark writing to Cassandra with varying TTL
问题描述
在Java Spark中,我有一个数据框,其中有一个"bucket_timestamp"列,该列代表该行所属存储桶的时间.
In Java Spark, I have a dataframe that has a 'bucket_timestamp' column, which represents the time of the bucket that the row belongs to.
我想将数据帧写入Cassandra DB.数据必须使用TTL写入DB.TTL应取决于存储桶时间戳记-每行的TTL应计算为 ROW_TTL = CONST_TTL-(CurrentTime-bucket_timestamp)
,其中 CONST_TTL
是一个恒定的TTL配置.
I want to write the dataframe to a Cassandra DB. The data must be written to the DB with TTL. The TTL should be depended on the bucket timestamp - where each row's TTL should be calculated as ROW_TTL = CONST_TTL - (CurrentTime - bucket_timestamp)
, where CONST_TTL
is a constant TTL that I configured.
目前,我正在使用常量TTL向Spark写入Cassandra,并使用以下代码:
Currently I am writing to Cassandra with spark using a constant TTL, with the following code:
df.write().format("org.apache.spark.sql.cassandra")
.options(new HashMap<String, String>() {
{
put("keyspace", "key_space_name");
put("table, "table_name");
put("spark.cassandra.output.ttl, Long.toString(CONST_TTL)); // Should be depended on bucket_timestamp column
}
}).mode(SaveMode.Overwrite).save();
我考虑的一种可能方法是-对于每个可能的bucket_timestamp-根据时间戳过滤数据,计算TTL并将过滤后的数据写入Cassandra.但这似乎效率很低,不是火花.Java Spark中是否有一种方法可以提供一个spark列作为TTL选项,以使每行的TTL都不相同?
One possible way I thought about is - for each possible bucket_timestamp - filter the data according to timestamp, calculate the TTL and write filtered data to Cassandra. but this seems very non-efficient and not the spark way. Is there a way in Java Spark to provide a spark column as the TTL option, so that the TTL will differ for each row?
解决方案应使用Java和数据集.Row>:我在scala中遇到了一些使用RDD执行此操作的解决方案,但是没有找到使用Java和数据框的解决方案.
Solution should be working with Java and dataset< Row>: I encountered some solutions for performing this with RDD in scala, but didn't find a solution for using Java and dataframe.
谢谢!
推荐答案
对于DataFrame API,尚不支持此类功能,但是...有JIRA-
For DataFrame API there is no support for such functionality, yet... There is JIRA for it - https://datastax-oss.atlassian.net/browse/SPARKC-416, you can watch it to get notified when it's implemented...
因此,您唯一的选择是使用@ bartosz25的答案中所述的RDD API ...
So only choice that you have is to use RDD API as described in the @bartosz25's answer...
这篇关于使用不同的TTL火花写入Cassandra的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!