如何删除 PySpark DataFrame 中具有空值的所有列? [英] How to drop all columns with null values in a PySpark DataFrame?
本文介绍了如何删除 PySpark DataFrame 中具有空值的所有列?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我有一个很大的数据集,我想删除包含 null
值的列并返回一个新的数据框.我该怎么做?
I have a large dataset of which I would like to drop columns that contain null
values and return a new dataframe. How can I do that?
以下仅删除包含 null
的一列或多行.
The following only drops a single column or rows containing null
.
df.where(col("dt_mvmt").isNull()) #doesnt work because I do not have all the columns names or for 1000's of columns
df.filter(df.dt_mvmt.isNotNull()) #same reason as above
df.na.drop() #drops rows that contain null, instead of columns that contain null
例如
a | b | c
1 | | 0
2 | 2 | 3
在上面的例子中,它会删除整列 B
因为它的一个值是空的.
In the above case it will drop the whole column B
because one of its values is empty.
推荐答案
这是删除所有具有 NULL 值的列的一种可能方法:请参阅 此处 获取每列 NULL 值计数代码的来源.
Here is one possible approach for dropping all columns that have NULL values: See here for the source on the code of counting NULL values per column.
import pyspark.sql.functions as F
# Sample data
df = pd.DataFrame({'x1': ['a', '1', '2'],
'x2': ['b', None, '2'],
'x3': ['c', '0', '3'] })
df = sqlContext.createDataFrame(df)
df.show()
def drop_null_columns(df):
"""
This function drops all columns which contain null values.
:param df: A PySpark DataFrame
"""
null_counts = df.select([F.count(F.when(F.col(c).isNull(), c)).alias(c) for c in df.columns]).collect()[0].asDict()
to_drop = [k for k, v in null_counts.items() if v > 0]
df = df.drop(*to_drop)
return df
# Drops column b2, because it contains null values
drop_null_columns(df).show()
之前:
+---+----+---+
| x1| x2| x3|
+---+----+---+
| a| b| c|
| 1|null| 0|
| 2| 2| 3|
+---+----+---+
之后:
+---+---+
| x1| x3|
+---+---+
| a| c|
| 1| 0|
| 2| 3|
+---+---+
希望这会有所帮助!
这篇关于如何删除 PySpark DataFrame 中具有空值的所有列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文