在 DataFrame 中用 None/null 值替换空字符串 [英] Replace empty strings with None/null values in DataFrame

查看:174
本文介绍了在 DataFrame 中用 None/null 值替换空字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个 Spark1.5.0 DataFrame 在同一列中混合了 null 和空字符串.我想将所有列中的所有空字符串转换为 null (None,在 Python 中).DataFrame 可能有数百列,所以我试图避免对每一列进行硬编码操作.

I have a Spark 1.5.0 DataFrame with a mix of null and empty strings in the same column. I want to convert all empty strings in all columns to null (None, in Python). The DataFrame may have hundreds of columns, so I'm trying to avoid hard-coded manipulations of each column.

请参阅下面我的尝试,这会导致错误.

See my attempt below, which results in an error.

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

## Create a test DataFrame
testDF = sqlContext.createDataFrame([Row(col1='foo', col2=1), Row(col1='', col2=2), Row(col1=None, col2='')])
testDF.show()
## +----+----+
## |col1|col2|
## +----+----+
## | foo|   1|
## |    |   2|
## |null|null|
## +----+----+

## Try to replace an empty string with None/null
testDF.replace('', None).show()
## ValueError: value should be a float, int, long, string, list, or tuple

## A string value of null (obviously) doesn't work...
testDF.replace('', 'null').na.drop(subset='col1').show()
## +----+----+
## |col1|col2|
## +----+----+
## | foo|   1|
## |null|   2|
## +----+----+

推荐答案

就是这么简单:

from pyspark.sql.functions import col, when

def blank_as_null(x):
    return when(col(x) != "", col(x)).otherwise(None)

dfWithEmptyReplaced = testDF.withColumn("col1", blank_as_null("col1"))

dfWithEmptyReplaced.show()
## +----+----+
## |col1|col2|
## +----+----+
## | foo|   1|
## |null|   2|
## |null|null|
## +----+----+

dfWithEmptyReplaced.na.drop().show()
## +----+----+
## |col1|col2|
## +----+----+
## | foo|   1|
## +----+----+

如果你想填充多列,你可以例如减少:

If you want to fill multiple columns you can for example reduce:

to_convert = set([...]) # Some set of columns

reduce(lambda df, x: df.withColumn(x, blank_as_null(x)), to_convert, testDF)

或使用理解:

exprs = [
    blank_as_null(x).alias(x) if x in to_convert else x for x in testDF.columns]

testDF.select(*exprs)

如果您想专门对字符串字段进行操作,请通过 答案stackoverflow.com/users/1902732/robin-loxley">robin-loxley.

If you want to specifically operate on string fields please check the answer by robin-loxley.

这篇关于在 DataFrame 中用 None/null 值替换空字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆