Apache Spark 需要 5 到 6 分钟来简单计算来自 Cassandra 的 10 亿行 [英] Apache Spark taking 5 to 6 minutes for simple count of 1 billon rows from Cassandra
问题描述
我使用的是 Spark Cassandra 连接器.从 Cassandra 表中获取数据需要 5-6 分钟.在 Spark 中,我在日志中看到了许多任务和 Executor.原因可能是 Spark 把流程分成了很多任务!
I am using the Spark Cassandra connector. It take 5-6 minutes for fetch data from Cassandra table. In Spark I have seen many tasks and Executor in log. The reason might be that Spark divided the process in many tasks!
下面是我的代码示例:
public static void main(String[] args) {
SparkConf conf = new SparkConf(true).setMaster("local[4]")
.setAppName("App_Name")
.set("spark.cassandra.connection.host", "127.0.0.1");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<Demo_Bean> empRDD = javaFunctions(sc).cassandraTable("dev",
"demo");
System.out.println("Row Count"+empRDD.count());
}
推荐答案
在 Google 上搜索后,我喜欢最新的 spark-cassandra-connector 中的问题.参数 spark.cassandra.input.split.size_in_mb
默认值为 64 MB,在代码中被解释为 64 字节.所以尝试spark.cassandra.input.split.size_in_mb = 64 * 1024 * 1024 = 67108864
After searching on Google i fond the issue in the latest spark-cassandra-connector.
The parameter spark.cassandra.input.split.size_in_mb
Default value is 64 MB which is being interpreted as 64 bytes in the code.
So try with
spark.cassandra.input.split.size_in_mb = 64 * 1024 * 1024 = 67108864
Hear 是一个例子:
Hear is an example :
public static void main(String[] args) {
SparkConf conf = new SparkConf(true).setMaster("local[4]")
.setAppName("App_Name")
.set("spark.cassandra.connection.host", "127.0.0.1")
.set("spark.cassandra.input.split.size_in_mb","67108864");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<Demo_Bean> empRDD = javaFunctions(sc).cassandraTable("dev",
"demo");
System.out.println("Row Count"+empRDD.count());
}
这篇关于Apache Spark 需要 5 到 6 分钟来简单计算来自 Cassandra 的 10 亿行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!