如何删除具有太多NULL值的行? [英] How to drop rows with too many NULL values?

查看:156
本文介绍了如何删除具有太多NULL值的行?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想对我的数据做一些预处理,我想删除稀疏的行(对于某些阈值).

I want to do some preprocessing on my data and I want to drop the rows that are sparse (for some threshold value).

例如,我有一个具有10个功能的数据帧表,并且我有一个具有8个空值的行,然后我想将其删除.

For example I have a dataframe table with 10 features, and I have a row with 8 null value, then I want to drop it.

我找到了一些相关主题,但是找不到任何有用的信息.

I found some related topics but I cannot find any useful information for my purpose.

stackoverflow.com/questions/3473778/count-number-ofnulls行中

上面的链接中的示例对我而言不起作用,因为我想自动进行此预处理.我不能写列名,也不能做相应的事情.

Examples like in the link above won't work for me, because I want to do this preprocessing automatically. I cannot write the column names and do something accordingly.

因此,是否可以在不使用Scala的Apache Spark中使用列名称的情况下执行此删除操作?

So is there anyway to do this delete operation without using the column names in Apache Spark with scala?

推荐答案

测试日期:

case class Document( a: String, b: String, c: String)
val df = sc.parallelize(Seq(new Document(null, null, null), new Document("a", null, null), new Document("a", "b", null), new Document("a", "b", "c"), new Document(null, null, "c"))).df

使用UDF

通过 David 和下面的我的RDD版本重新混合答案,您可以使用需要行:

Remixing the answer by David and my RDD version below, you can do it using a UDF that takes a row:

def nullFilter = udf((x:Row) => {Range(0, x.length).count(x.isNullAt(_)) < 2})
df.filter(nullFilter(struct(df.columns.map(df(_)) : _*))).show

使用RDD

您可以将其转换为rdd,循环显示Row中的列,并计算有多少为空.

You could turn it into a rdd, loop of the columns in the Row and count how many are null.

sqlContext.createDataFrame(df.rdd.filter( x=> Range(0, x.length).count(x.isNullAt(_)) < 2 ), df.schema).show

这篇关于如何删除具有太多NULL值的行?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆