Spark：使用列的平均值替换数据框中的空值 [英] Spark: replace null values in dataframe with mean of column

查看：1308 发布时间：2019/1/9 20:08:21 java sql scala apache-spark

本文介绍了Spark：使用列的平均值替换数据框中的空值的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

如何创建UDF以使用列平均值以编程方式替换每列中spark数据帧中的空值。例如，在示例数据中，col1 null值的值为（（2 + 4 + 6 + 8 + 5）/ 5）= 5.

How can I create a UDF to programatically replace null values in a spark dataframe in each column with the column mean value. for instance in the example data col1 null value will have a value of ((2+4+6+8+5)/5) = 5.

示例数据：

col1    col2    col3
2       null    3
4       3       3
6       5       null
8       null    2
null    6       4
5       2       8

所需数据：

col1    col2    col3
2       4       3
4       3       3
6       5       4
8       4       2
5       6       4
5       2       8

推荐答案

一般来说，这里不需要UDF。所有你真的是聚合表：

Generally speaking there is no need for UDF here. All you really is aggregated table:

val df = Seq(
  (Some(2), None, Some(3)), (Some(4), Some(3), Some(3)),
  (Some(6), Some(5), None), (Some(8), None, Some(2)),
  (None, Some(6), Some(4)), (Some(5), Some(2), Some(8))
).toDF("col1", "col2", "col3").alias("df")

val means = df.agg(df.columns.map(c => (c -> "avg")).toMap)

广播笛卡尔与合并：

val exprs = df.columns.map(c => coalesce(col(c), col(s"avg($c)")).alias(c))

df.join(broadcast(means)).select(exprs: _*)

这篇关于Spark：使用列的平均值替换数据框中的空值的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Spark：使用列的平均值替换数据框中的空值 [英] Spark: replace null values in dataframe with mean of column

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录关闭

Spark：使用列的平均值替换数据框中的空值 [英] Spark: replace null values in dataframe with mean of column

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录 关闭

登录关闭