spark中null和NaN之间的区别?如何处理? [英] Differences between null and NaN in spark? How to deal with it?

查看：33 发布时间：2021/12/22 21:35:07 python apache-spark null pyspark nan

本文介绍了spark中null和NaN之间的区别?如何处理?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

在我的DataFrame中，有分别包含null和NaN值的列，例如:

In my DataFrame, there are columns including values of null and NaN respectively, such as:

df = spark.createDataFrame([(1, float('nan')), (None, 1.0)], ("a", "b"))
df.show()

+----+---+
|   a|  b|
+----+---+
|   1|NaN|
|null|1.0|
+----+---+

它们之间有什么区别吗?如何处理?

Are there any difference between those? How can they be dealt with?

null values 代表无值"或无"，它甚至不是空字符串或零.它可以用来表示没有任何有用的东西存在.

null values represents "no value" or "nothing", it's not even an empty string or zero. It can be used to represent that nothing useful exists.

NaN 代表非数字"，它通常是没有意义的数学运算的结果，例如0.0/0.0.

NaN stands for "Not a Number", it's usually the result of a mathematical operation that doesn't make sense, e.g. 0.0/0.0.

处理 null 值的一种可能方法是删除它们:

One possible way to handle null values is to remove them with:

df.na.drop()

或者您可以将它们更改为实际值(这里我使用了 0):

Or you can change them to an actual value (here I used 0) with:

df.na.fill(0)

另一种方法是选择特定列为 null 的行以进行进一步处理:

Another way would be to select the rows where a specific column is null for further processing:

df.where(col("a").isNull())
df.where(col("a").isNotNull())

也可以使用等效方法选择带有 NaN 的行:

Rows with NaN can also be selected using the equivalent method:

from pyspark.sql.functions import isnan
df.where(isnan(col("a")))

这篇关于spark中null和NaN之间的区别?如何处理?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文