火花柱状性能 [英] Spark columnar performance
问题描述
我是Spark的相对入门者.我有一个宽数据框(1000列),我想根据相应的列是否缺少值来添加列
I'm a relative beginner to things Spark. I have a wide dataframe (1000 columns) that I want to add columns to based on whether a corresponding column has missing values
如此
+----+
| A |
+----+
| 1 |
+----+
|null|
+----+
| 3 |
+----+
成为
+----+-------+
| A | A_MIS |
+----+-------+
| 1 | 0 |
+----+-------+
|null| 1 |
+----+-------+
| 3 | 1 |
+----+-------+
这是自定义ml转换器的一部分,但算法应清晰明了.
This is part of a custom ml transformer but the algorithm should be clear.
override def transform(dataset: org.apache.spark.sql.Dataset[_]): org.apache.spark.sql.DataFrame = {
var ds = dataset
dataset.columns.foreach(c => {
if (dataset.filter(col(c).isNull).count() > 0) {
ds = ds.withColumn(c + "_MIS", when(col(c).isNull, 1).otherwise(0))
}
})
ds.toDF()
}
在列上循环,如果> 0,则创建一个新列.
Loop over the columns, if > 0 nulls create a new column.
传入的数据集被缓存(使用.cache方法),并且相关的配置设置是默认设置. 目前,它正在单台笔记本电脑上运行,并且即使有最少的行,也能在40分钟内运行1000列. 我以为问题是由于命中数据库造成的,所以我尝试使用镶木地板文件,但结果相同.查看作业UI似乎是在进行文件扫描以进行计数.
The dataset passed in is cached (using the .cache method) and the relevant config settings are the defaults. This is running on a single laptop for now, and runs in the order of 40 minutes for the 1000 columns even with a minimal amount of rows. I thought the problem was due to hitting a database, so I tried with a parquet file instead with the same result. Looking at the jobs UI it appears to be doing filescans in order to do the count.
有没有一种方法可以改进此算法以获得更好的性能,或以某种方式调整缓存?不断增加spark.sql.inMemoryColumnarStorage.batchSize会使我出现OOM错误.
Is there a way I can improve this algorithm to get better performance, or tune the cacheing in some way? Increasing spark.sql.inMemoryColumnarStorage.batchSize just got me an OOM error.
推荐答案
删除条件:
if (dataset.filter(col(c).isNull).count() > 0)
,仅保留内部表达式.在撰写本文时,Spark需要进行#列数据扫描.
and leave only the internal expression. As it is written Spark requires #columns data scans.
如果您希望修剪列一次计算统计信息,如概述的那样,使用Pyspark计算Spark数据帧每列中非NaN条目的数量,并使用单个drop
调用.
If you want prune columns compute statistics once, as outlined in Count number of non-NaN entries in each column of Spark dataframe with Pyspark, and use single drop
call.
这篇关于火花柱状性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!