使用Pyspark/Dataframe时,如何将谓词下推到Cassandra或限制请求的数据? [英] How can you pushdown predicates to Cassandra or limit requested data when using Pyspark / Dataframes?

查看:88
本文介绍了使用Pyspark/Dataframe时,如何将谓词下推到Cassandra或限制请求的数据?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

例如在 docs.datastax.com 上>我们提到:

table1 = sqlContext.read.format("org.apache.spark.sql.cassandra").options(table="kv", keyspace="ks").load()

是我知道的唯一方法,但是可以说,我只想加载该表中的最后一百万个条目.我不想每次都将整个表加载到内存中,尤其是如果该表有超过一千万个条目的话.

and its the only way I know, but lets say that I want to load only the last one million entries from this table. I don't want to load the whole table in memory every time, especially if this table has for example, over 10 million entries.

谢谢!

推荐答案

虽然无法更快地加载数据.您可以加载部分数据或提早终止.Spark DataFrames利用催化剂来优化其基础查询计划,从而使其能够走捷径.

While you can't load data faster. You can load portions of the data or terminate early. Spark DataFrames utilize catalyst to optimize it's underlying query plans enables it to take some short cuts.

例如,调用 limit 将允许Spark跳过从基础DataSource读取某些部分的操作.这些将通过取消执行任务来限制从Cassandra读取的数据量.

For example calling limit will allow Spark to skip reading some portions from the underlying DataSource. These would limit the amount of data read from Cassandra by canceling tasks from being executed.

调用过滤器或添加过滤器可由基础数据源使用,以帮助限制实际从Cassandra提取的信息量.可以推送的内容有限制,但是在文档中对此进行了详细说明.

Calling filter, or adding filters can be utilized by the underlying Datasource to help restrict the amount of information actually pulled from Cassandra. There are limitations on what can be pushed down but this is all detailed in the documentation.

https://github.com/datastax/spark-cassandra-connector/blob/master/doc/14_data_frames.md#pushing-down-clauses-to-cassandra

请注意,所有这些都可以通过在调用DataSource时简单地对它进行进一步的api调用来完成.例如

Note all of this is accomplished by simply doing further api calls on your DataSource once you call it. For example

val df = sqlContext
  .read
  .format("org.apache.spark.sql.cassandra")
  .options(table="kv", keyspace="ks")
  .load()

df.show(10) // Will compute only enough tasks to get 10 records and no more
df.filter(clusteringKey > 5).show() //Will pass down the clustering predicate to C*

这篇关于使用Pyspark/Dataframe时,如何将谓词下推到Cassandra或限制请求的数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆