Apache Spark SQL永远需要统计来自Cassandra的十亿行? [英] Apache Spark SQL is taking forever to count billion rows from Cassandra?

查看:79
本文介绍了Apache Spark SQL永远需要统计来自Cassandra的十亿行?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下代码

我按如下方式调用spark-shell

I invoke spark-shell as follows

./spark-shell --conf spark.cassandra.connection.host=170.99.99.134 --executor-memory 15G --executor-cores 12 --conf spark.cassandra.input.split.size_in_mb=67108864

代码

scala> val df = spark.sql("SELECT test from hello") // Billion rows in hello and test column is 1KB

df: org.apache.spark.sql.DataFrame = [test: binary]

scala> df.count

[Stage 0:>   (0 + 2) / 13] // I dont know what these numbers mean precisely.

如果我按如下方式调用spark-shell

If I invoke spark-shell as follows

./spark-shell --conf spark.cassandra.connection.host=170.99.99.134

代码

val df = spark.sql("SELECT test from hello") // This has about billion rows

scala> df.count


[Stage 0:=>  (686 + 2) / 24686] // What are these numbers precisely?

这两个版本均无法正常工作,Spark可以永远运行,我已经等待了15分钟以上,但没有任何响应.关于可能出什么问题以及如何解决此问题的任何想法?

Both of these versions didn't work Spark keeps running forever and I have been waiting for more than 15 mins and no response. Any ideas on what could be wrong and how to fix this?

我正在使用Spark 2.0.2和spark-cassandra-connector_2.11-2.0.0-M3.jar

I am using Spark 2.0.2 and spark-cassandra-connector_2.11-2.0.0-M3.jar

推荐答案

Dataset.count 速度很慢,因为它在处理外部数据源时不是很聪明.它将查询重写为(很好):

Dataset.count is slow because it is not very smart when it comes to external data sources. It rewrites query as (it is good):

SELECT COUNT(1) FROM table

但不执行以下操作而不是按下 COUNT :

but instead of pushing COUNT down it just executes :

SELECT 1 FROM table

相对于源(在您的情况下,它将获取十亿个),然后在本地进行汇总以获得最终结果.您看到的数字是任务计数器.

against the source (it'll fetch a billion ones in your case) and then aggregates locally to get the final result. Numbers you see are tasks counters.

CassandraRDD 上有一个优化的 cassandraCount 操作:

There is an optimized cassandraCount operation on CassandraRDD:

sc.cassandraTable(keyspace, table).cassandraCount

有关服务器端操作的更多信息,请参见文档.

More about server side operations can be found in the documentation.

这篇关于Apache Spark SQL永远需要统计来自Cassandra的十亿行?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆