通过处理null来执行Spark Scala按行平均 [英] Spark Scala row-wise average by handling null
问题描述
我有一个数据量很大的数据帧,列数为"n".
I've a dataframe with high volume of data and "n" number of columns.
df_avg_calc: org.apache.spark.sql.DataFrame = [col1: double, col2: double ... 4 more fields]
+------------------+-----------------+------------------+-----------------+-----+-----+
| col1| col2| col3| col4| col5| col6|
+------------------+-----------------+------------------+-----------------+-----+-----+
| null| null| null| null| null| null|
| 14.0| 5.0| 73.0| null| null| null|
| null| null| 28.25| null| null| null|
| null| null| null| null| null| null|
|33.723333333333336|59.78999999999999|39.474999999999994|82.09666666666666|101.0|53.43|
| 26.25| null| null| 2.0| null| null|
| null| null| null| null| null| null|
| 54.46| 89.475| null| null| null| null|
| null| 12.39| null| null| null| null|
| null| 58.0| 19.45| 1.0| 1.33|158.0|
+------------------+-----------------+------------------+-----------------+-----+-----+
我需要记住要进行行平均,不要考虑将"null"的单元用于平均.
I need to perform rowwise average keeping in mind not to consider the cell with "null" for averaging.
这需要在Spark/Scala中实现.我试图解释与附件中的图像相同
This needs to be implemented in Spark / Scala. I've tried to explain the same as in the attached image
到目前为止,我已经尝试过:
What I have tried so far :
val df = df_raw.schema.fieldNames.filter(f => f.contains("colname"))
val rowMeans = df_raw.select(df.map(f => col(f)).reduce(+) / lit(df.length) as "row_mean")
df_raw包含需要汇总的列(当然是rowise).有超过80列.它们任意具有数据且为null,在计算平均值时,分母中的Null计数需要忽略.当所有列都包含数据时,即使列中的单个Null都返回Null,它也可以正常工作
Where df_raw contains columns which needs to be aggregated (of course rowise). There are more than 80 columns. Arbitrarily they have data and null, count of Null needs to be ignored in the denominator while calculating average. It works fine, when all the column contain data, even a single Null in a column returns Null
修改: