使用不同的TTL火花写入Cassandra [英] Spark writing to Cassandra with varying TTL

查看:79
本文介绍了使用不同的TTL火花写入Cassandra的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在Java Spark中,我有一个数据框,其中有一个"bucket_timestamp"列,该列代表该行所属存储桶的时间.

In Java Spark, I have a dataframe that has a 'bucket_timestamp' column, which represents the time of the bucket that the row belongs to.

我想将数据帧写入Cassandra DB.数据必须使用TTL写入DB.TTL应取决于存储桶时间戳记-每行的TTL应计算为 ROW_TTL = CONST_TTL-(CurrentTime-bucket_timestamp),其中 CONST_TTL 是一个恒定的TTL配置.

I want to write the dataframe to a Cassandra DB. The data must be written to the DB with TTL. The TTL should be depended on the bucket timestamp - where each row's TTL should be calculated as ROW_TTL = CONST_TTL - (CurrentTime - bucket_timestamp), where CONST_TTL is a constant TTL that I configured.

目前,我正在使用常量TTL向Spark写入Cassandra,并使用以下代码:

Currently I am writing to Cassandra with spark using a constant TTL, with the following code:

df.write().format("org.apache.spark.sql.cassandra")
            .options(new HashMap<String, String>() {
                {
                    put("keyspace", "key_space_name");
                    put("table, "table_name");
                    put("spark.cassandra.output.ttl, Long.toString(CONST_TTL)); // Should be depended on bucket_timestamp column
                }
            }).mode(SaveMode.Overwrite).save();

我考虑的一种可能方法是-对于每个可能的bucket_timestamp-根据时间戳过滤数据,计算TTL并将过滤后的数据写入Cassandra.但这似乎效率很低,不是火花.Java Spark中是否有一种方法可以提供一个spark列作为TTL选项,以使每行的TTL都不相同?

One possible way I thought about is - for each possible bucket_timestamp - filter the data according to timestamp, calculate the TTL and write filtered data to Cassandra. but this seems very non-efficient and not the spark way. Is there a way in Java Spark to provide a spark column as the TTL option, so that the TTL will differ for each row?

解决方案应使用Java和数据集.Row>:我在scala中遇到了一些使用RDD执行此操作的解决方案,但是没有找到使用Java和数据框的解决方案.

Solution should be working with Java and dataset< Row>: I encountered some solutions for performing this with RDD in scala, but didn't find a solution for using Java and dataframe.

谢谢!

推荐答案

对于DataFrame API,尚不支持此类功能,但是...有JIRA-

For DataFrame API there is no support for such functionality, yet... There is JIRA for it - https://datastax-oss.atlassian.net/browse/SPARKC-416, you can watch it to get notified when it's implemented...

因此,您唯一的选择是使用@ bartosz25的答案中所述的RDD API ...

So only choice that you have is to use RDD API as described in the @bartosz25's answer...

这篇关于使用不同的TTL火花写入Cassandra的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆