星火SQL和卡桑德拉JOIN [英] Spark SQL and Cassandra JOIN

查看:179
本文介绍了星火SQL和卡桑德拉JOIN的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的卡珊德拉架构包含这是一个时间戳的分区键的表,和参数列这是一个聚集键。

My Cassandra schema contains a table with a partition key which is a timestamp, and a parameter column which is a clustering key.

每个分区包含10K +行。此以每秒1分区的速率记录数据。

Each partition contains 10k+ rows. This is logging data at a rate of 1 partition per second.

在另一方面,用户可以定义数据集和我有包含另一个表中,作为分区键的数据集名称和一个聚类柱是时间戳参照其他表(这样一数据集是分区键的列表)。

On the other hand, users can define "datasets" and I have another table which contains, as a partition key the "dataset name" and a clustering column which is a timestamp referring to the other table (so a "dataset" is a list of partition keys).

当然,我想这样做看起来像Cassandra的一个反模式,因为我想加入两个表。

Of course what I would like to do looks like an anti-pattern for Cassandra as I'd like to join two tables.

不过使用SQL星火我可以运行这样一个查询,并执行加入

However using Spark SQL I can run such a query and perform the JOIN.

SELECT * from datasets JOIN data 
    WHERE data.timestamp = datasets.timestamp AND datasets.name = 'my_dataset'

现在的问题是:星火SQL足够聪明,只读数据对应于时间戳 S IN 数据集定义

Now the question is: is Spark SQL smart enough to read only the partitions of data which correspond to the timestamps defined in datasets?

推荐答案

编辑:关于连接优化解决答案

fix the answer with regard to join optimization

是星火SQL足够聪明的只读对应于数据集定义的时间戳数据的分区?

is Spark SQL smart enough to read only the partitions of data which correspond to the timestamps defined in datasets?

没有。实际上,由于您提供的数据集表分区键,星火/ Cassandra的连接器将执行 predicate按下并直接卡桑德拉是 CQL 。但不会有predicate倒推的连接操作本身,除非您使用 joinWithCassandraTable()

No. In fact, since you provide the partition key for the datasets table, the Spark/Cassandra connector will perform predicate push down and execute the partition restriction directly in Cassandra with CQL. But there will be no predicate push down for the join operation itself unless you use the RDD API with joinWithCassandraTable()

在这里看到所有可能的predicate倒推的情况:<一href=\"https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector/src/main/scala/org/apache/spark/sql/cassandra/BasicCassandra$p$pdicatePushDown.scala\" rel=\"nofollow\">https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector/src/main/scala/org/apache/spark/sql/cassandra/BasicCassandra$p$pdicatePushDown.scala

See here for all possible predicate push down situations: https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector/src/main/scala/org/apache/spark/sql/cassandra/BasicCassandraPredicatePushDown.scala

这篇关于星火SQL和卡桑德拉JOIN的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆