Spark 仅获取具有一个或多个空值的列 [英] Spark Get only columns that have one or more null values

查看:20
本文介绍了Spark 仅获取具有一个或多个空值的列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从数据框中获取包含至少一个空值的列的名称.

From a dataframe I want to get names of columns which contain at least one null value inside.

考虑下面的数据框:

val dataset = sparkSession.createDataFrame(Seq(
  (7, null, 18, 1.0),
  (8, "CA", null, 0.0),
  (9, "NZ", 15, 0.0)
)).toDF("id", "country", "hour", "clicked")

我想获取列名称国家/地区"和小时".

I want to get column names 'Country' and 'Hour'.

id  country hour    clicked
7   null    18      1
8   "CA"    null    0
9   "NZ"    15      0

推荐答案

这是一个解决方案,但是有点别扭,希望有更简单的方法:

this is one solution, but it's a bit awkward, I hope there is an easier way:

val cols = dataset.columns

val columnsToSelect = dataset
  // count null values (by summing up 1s if its null)
  .select(cols.map(c => (sum(when(col(c).isNull,1))>0).alias(c)):_*)
  .head() // collect result of aggregation
  .getValuesMap[Boolean](cols) // now get columns which are "true"
  .filter{case (c,hasNulls) => hasNulls}
  .keys.toSeq // and get the name of those columns


dataset
  .select(columnsToSelect.head,columnsToSelect.tail:_*)
  .show()
+-------+----+
|country|hour|
+-------+----+
|   null|  18|
|     CA|null|
|     NZ|  15|
+-------+----+

这篇关于Spark 仅获取具有一个或多个空值的列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆