Spark SQL 和 Cassandra JOIN [英] Spark SQL and Cassandra JOIN

查看：26 发布时间：2021/11/14 22:34:39 apache-spark cassandra apache-spark-sql

本文介绍了Spark SQL 和 Cassandra JOIN的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我的 Cassandra 模式包含一个表，其中的分区键是时间戳，而 parameter 列是集群键.

My Cassandra schema contains a table with a partition key which is a timestamp, and a parameter column which is a clustering key.

每个分区包含 10k+ 行.这是以每秒 1 个分区的速率记录数据.

Each partition contains 10k+ rows. This is logging data at a rate of 1 partition per second.

另一方面，用户可以定义数据集"，我有另一个表，其中包含作为分区键的数据集名称"和一个集群列，它是引用另一个表的时间戳(因此是数据集"是分区键列表).

On the other hand, users can define "datasets" and I have another table which contains, as a partition key the "dataset name" and a clustering column which is a timestamp referring to the other table (so a "dataset" is a list of partition keys).

当然，我想做的事情看起来像是 Cassandra 的反模式，因为我想加入两个表.

Of course what I would like to do looks like an anti-pattern for Cassandra as I'd like to join two tables.

但是使用 Spark SQL 我可以运行这样的查询并执行 JOIN.

However using Spark SQL I can run such a query and perform the JOIN.

SELECT * from datasets JOIN data 
    WHERE data.timestamp = datasets.timestamp AND datasets.name = 'my_dataset'

现在的问题是:Spark SQL 是否足够智能以仅读取 data 的分区，这些分区对应于 datasets 中定义的 timestamp?

Now the question is: is Spark SQL smart enough to read only the partitions of data which correspond to the timestamps defined in datasets?

Spark SQL 和 Cassandra JOIN [英] Spark SQL and Cassandra JOIN

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Spark SQL 和 Cassandra JOIN [英] Spark SQL and Cassandra JOIN

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭