Apache Spark SQL永远需要统计来自Cassandra的十亿行? [英] Apache Spark SQL is taking forever to count billion rows from Cassandra?
问题描述
我有以下代码
我按如下方式调用spark-shell
I invoke spark-shell as follows
./spark-shell --conf spark.cassandra.connection.host=170.99.99.134 --executor-memory 15G --executor-cores 12 --conf spark.cassandra.input.split.size_in_mb=67108864
代码
scala> val df = spark.sql("SELECT test from hello") // Billion rows in hello and test column is 1KB
df: org.apache.spark.sql.DataFrame = [test: binary]
scala> df.count
[Stage 0:> (0 + 2) / 13] // I dont know what these numbers mean precisely.
如果我按如下方式调用spark-shell
If I invoke spark-shell as follows
./spark-shell --conf spark.cassandra.connection.host=170.99.99.134
代码
val df = spark.sql("SELECT test from hello") // This has about billion rows
scala> df.count
[Stage 0:=> (686 + 2) / 24686] // What are these numbers precisely?
这两个版本均无法正常工作,Spark可以永远运行,我已经等待了15分钟以上,但没有任何响应.关于可能出什么问题以及如何解决此问题的任何想法?
Both of these versions didn't work Spark keeps running forever and I have been waiting for more than 15 mins and no response. Any ideas on what could be wrong and how to fix this?
我正在使用Spark 2.0.2和spark-cassandra-connector_2.11-2.0.0-M3.jar
I am using Spark 2.0.2 and spark-cassandra-connector_2.11-2.0.0-M3.jar
推荐答案
Dataset.count
速度很慢,因为它在处理外部数据源时不是很聪明.它将查询重写为(很好):
Dataset.count
is slow because it is not very smart when it comes to external data sources. It rewrites query as (it is good):
SELECT COUNT(1) FROM table
但不执行以下操作而不是按下 COUNT
:
but instead of pushing COUNT
down it just executes :
SELECT 1 FROM table
相对于源(在您的情况下,它将获取十亿个),然后在本地进行汇总以获得最终结果.您看到的数字是任务计数器.
against the source (it'll fetch a billion ones in your case) and then aggregates locally to get the final result. Numbers you see are tasks counters.
在 CassandraRDD
上有一个优化的 cassandraCount
操作:
There is an optimized cassandraCount
operation on CassandraRDD
:
sc.cassandraTable(keyspace, table).cassandraCount
有关服务器端操作的更多信息,请参见文档.
More about server side operations can be found in the documentation.
这篇关于Apache Spark SQL永远需要统计来自Cassandra的十亿行?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!