随着数据集和列数的增加,Spark作业的执行时间呈指数增长 [英] Spark job execution time exponentially increases with very wide dataset and number of columns

查看:117
本文介绍了随着数据集和列数的增加,Spark作业的执行时间呈指数增长的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在spark中创建了一个固定宽度的文件导入解析器,并对各种数据集执行了一些执行测试. 它最多可以处理1000列,但是,随着列数和固定宽度长度的增加,Spark作业性能会迅速下降.在20k列和固定宽度长度超过10万的列上执行需要花费大量时间.

I have created a fixed width file import parser in spark and performed a few execution test on various datasets. It works fine up to 1000 columns, but, as the number of columns and fixed width length increases, Spark job performance decreases rapidly. It takes a lot of time to execute on 20k columns and fixed width length more than 100 thousand.

这可能是什么原因? 如何提高性能?

What are the possible reasons for this? How can I improve the performance?

我发现的类似问题之一:

One of the similar issues I found:

推荐答案

如果列数较多,则最好将记录读/转换为数组,并使用slice函数将其映射到各个列.使用子字符串获取单个列将效率不高.

If you are having more number of columns, it is better to read/convert the record as an array and use the slice function to map it to individual columns. Using substring to get individual columns will not be that efficient.

我通过将Array [String]附加到scala中的案例类Record()来作为示例.您可以将其扩展到hdfs文本文件

I used a Array[String] as an example by attaching it to a case class Record() in scala. You can extend it to hdfs textfiles

scala> case class Record(a1:String,a2:Int,a3:java.time.LocalDate)
defined class Record

scala>  val x = sc.parallelize(Array("abcd1232018-01-01","defg4562018-02-01"))
x: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[3] at parallelize at <console>:24

scala> val y = x.map( a => Record( a.slice(0,4), a.slice(4,4+3).toInt,java.time.LocalDate.parse(a.slice(7,7+10))))
y: org.apache.spark.rdd.RDD[Record] = MapPartitionsRDD[4] at map at <console>:27

scala> y.collect()
res3: Array[Record] = Array(Record(abcd,123,2018-01-01), Record(defg,456,2018-02-01))

scala>

这篇关于随着数据集和列数的增加,Spark作业的执行时间呈指数增长的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆