Spark仅获取具有一个或多个空值的列 [英] Spark Get only columns that have one or more null values
本文介绍了Spark仅获取具有一个或多个空值的列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我想从一个数据框中获取其中至少包含一个空值的列的名称.
From a dataframe I want to get names of columns which contain at least one null value inside.
考虑以下数据框:
val dataset = sparkSession.createDataFrame(Seq(
(7, null, 18, 1.0),
(8, "CA", null, 0.0),
(9, "NZ", 15, 0.0)
)).toDF("id", "country", "hour", "clicked")
我想获取列名国家"和小时".
I want to get column names 'Country' and 'Hour'.
id country hour clicked
7 null 18 1
8 "CA" null 0
9 "NZ" 15 0
推荐答案
这是一个解决方案,但是有点尴尬,我希望有一个更简单的方法:
this is one solution, but it's a bit awkward, I hope there is an easier way:
val cols = dataset.columns
val columnsToSelect = dataset
// count null values (by summing up 1s if its null)
.select(cols.map(c => (sum(when(col(c).isNull,1))>0).alias(c)):_*)
.head() // collect result of aggregation
.getValuesMap[Boolean](cols) // now get columns which are "true"
.filter{case (c,hasNulls) => hasNulls}
.keys.toSeq // and get the name of those columns
dataset
.select(columnsToSelect.head,columnsToSelect.tail:_*)
.show()
+-------+----+
|country|hour|
+-------+----+
| null| 18|
| CA|null|
| NZ| 15|
+-------+----+
这篇关于Spark仅获取具有一个或多个空值的列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文