Pyspark用NULL替换NaN [英] Pyspark replace NaN with NULL

查看：455 发布时间：2020/9/4 19:02:47 python pyspark-sql

本文介绍了Pyspark用NULL替换NaN的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我使用Spark来执行加载到Redshift的数据转换. Redshift不支持NaN值，因此我需要将所有出现的NaN替换为NULL.

I use Spark to perform data transformations that I load into Redshift. Redshift does not support NaN values, so I need to replace all occurrences of NaN with NULL.

我尝试过这样的事情:

some_table = sql('SELECT * FROM some_table')
some_table = some_table.na.fill(None)

但是我遇到了以下错误:

But I got the following error:

ValueError:值应为float，int，long，string，bool或dict

ValueError: value should be a float, int, long, string, bool or dict

所以看来na.fill()不支持None.我特别需要替换为NULL，而不是其他值，例如0.

So it seems like na.fill() doesn't support None. I specifically need to replace with NULL, not some other value, like 0.

推荐答案

我在Google搜索了一下之后终于找到了答案.

I finally found the answer after Googling around a bit.

df = spark.createDataFrame([(1, float('nan')), (None, 1.0)], ("a", "b"))
df.show()

+----+---+
|   a|  b|
+----+---+
|   1|NaN|
|null|1.0|
+----+---+

import pyspark.sql.functions as F
columns = df.columns
for column in columns:
    df = df.withColumn(column,F.when(F.isnan(F.col(column)),None).otherwise(F.col(column)))

sqlContext.registerDataFrameAsTable(df, "df2")
sql('select * from df2').show()

+----+----+
|   a|   b|
+----+----+
|   1|null|
|null| 1.0|
+----+----+

它不使用na.fill()，但是它实现了相同的结果，所以我很高兴.

It doesn't use na.fill(), but it accomplished the same result, so I'm happy.

这篇关于Pyspark用NULL替换NaN的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Pyspark用NULL替换NaN [英] Pyspark replace NaN with NULL

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Pyspark用NULL替换NaN [英] Pyspark replace NaN with NULL

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭