Apache Spark SQL 需要永远计算来自 Cassandra 的十亿行? [英] Apache Spark SQL is taking forever to count billion rows from Cassandra?

查看:26
本文介绍了Apache Spark SQL 需要永远计算来自 Cassandra 的十亿行?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下代码

我调用 spark-shell 如下

I invoke spark-shell as follows

./spark-shell --conf spark.cassandra.connection.host=170.99.99.134 --executor-memory 15G --executor-cores 12 --conf spark.cassandra.input.split.size_in_mb=67108864

代码

scala> val df = spark.sql("SELECT test from hello") // Billion rows in hello and test column is 1KB

df: org.apache.spark.sql.DataFrame = [test: binary]

scala> df.count

[Stage 0:>   (0 + 2) / 13] // I dont know what these numbers mean precisely.

如果我调用 spark-shell 如下

If I invoke spark-shell as follows

./spark-shell --conf spark.cassandra.connection.host=170.99.99.134

代码

val df = spark.sql("SELECT test from hello") // This has about billion rows

scala> df.count


[Stage 0:=>  (686 + 2) / 24686] // What are these numbers precisely?

这两个版本都不起作用 Spark 一直在运行,我已经等了 15 多分钟没有响应.关于可能出什么问题以及如何解决此问题的任何想法?

Both of these versions didn't work Spark keeps running forever and I have been waiting for more than 15 mins and no response. Any ideas on what could be wrong and how to fix this?

我使用的是 Spark 2.0.2和 spark-cassandra-connector_2.11-2.0.0-M3.jar

I am using Spark 2.0.2 and spark-cassandra-connector_2.11-2.0.0-M3.jar

推荐答案

Dataset.count 很慢,因为它在涉及外部数据源时不是很聪明.它将查询重写为(很好):

Dataset.count is slow because it is not very smart when it comes to external data sources. It rewrites query as (it is good):

SELECT COUNT(1) FROM table

但是它没有按下 COUNT 而是直接执行:

but instead of pushing COUNT down it just executes :

SELECT 1 FROM table

针对源(在您的情况下它将获取十亿个),然后在本地聚合以获得最终结果.您看到的数字是任务计数器.

against the source (it'll fetch a billion ones in your case) and then aggregates locally to get the final result. Numbers you see are tasks counters.

CassandraRDD上有一个优化的cassandraCount操作:

There is an optimized cassandraCount operation on CassandraRDD:

sc.cassandraTable(keyspace, table).cassandraCount

关于服务器端操作的更多信息可以在 文档.

More about server side operations can be found in the documentation.

这篇关于Apache Spark SQL 需要永远计算来自 Cassandra 的十亿行?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆