Spark SQL 和 Cassandra JOIN [英] Spark SQL and Cassandra JOIN

查看:26
本文介绍了Spark SQL 和 Cassandra JOIN的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的 Cassandra 模式包含一个表,其中的分区键是时间戳,而 parameter 列是集群键.

My Cassandra schema contains a table with a partition key which is a timestamp, and a parameter column which is a clustering key.

每个分区包含 10k+ 行.这是以每秒 1 个分区的速率记录数据.

Each partition contains 10k+ rows. This is logging data at a rate of 1 partition per second.

另一方面,用户可以定义数据集",我有另一个表,其中包含作为分区键的数据集名称"和一个集群列,它是引用另一个表的时间戳(因此是数据集"是分区键列表).

On the other hand, users can define "datasets" and I have another table which contains, as a partition key the "dataset name" and a clustering column which is a timestamp referring to the other table (so a "dataset" is a list of partition keys).

当然,我想做的事情看起来像是 Cassandra 的反模式,因为我想加入两个表.

Of course what I would like to do looks like an anti-pattern for Cassandra as I'd like to join two tables.

但是使用 Spark SQL 我可以运行这样的查询并执行 JOIN.

However using Spark SQL I can run such a query and perform the JOIN.

SELECT * from datasets JOIN data 
    WHERE data.timestamp = datasets.timestamp AND datasets.name = 'my_dataset'

现在的问题是:Spark SQL 是否足够智能以仅读取 data 的分区,这些分区对应于 datasets 中定义的 timestamp?

Now the question is: is Spark SQL smart enough to read only the partitions of data which correspond to the timestamps defined in datasets?

推荐答案

修复关于连接优化的答案

fix the answer with regard to join optimization

Spark SQL 是否足够智能以仅读取与数据集中定义的时间戳对应的数据分区?

is Spark SQL smart enough to read only the partitions of data which correspond to the timestamps defined in datasets?

没有.事实上,由于您为数据集表提供了分区键,Spark/Cassandra 连接器将执行谓词下推,并使用 CQL 在 Cassandra 中直接执行分区限制.但是除非您将 RDD API 与 joinWithCassandraTable()

No. In fact, since you provide the partition key for the datasets table, the Spark/Cassandra connector will perform predicate push down and execute the partition restriction directly in Cassandra with CQL. But there will be no predicate push down for the join operation itself unless you use the RDD API with joinWithCassandraTable()

所有可能的谓词下推情况请参见此处:https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector/src/main/scala/org/apache/spark/sql/cassandra/BasicCassandraPredicatePushDown.scala

See here for all possible predicate push down situations: https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector/src/main/scala/org/apache/spark/sql/cassandra/BasicCassandraPredicatePushDown.scala

这篇关于Spark SQL 和 Cassandra JOIN的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆