火花中的null和NaN之间的区别?怎么处理呢? [英] Differences between null and NaN in spark? How to deal with it?

查看：288 发布时间：2020/5/16 20:50:23 python apache-spark null pyspark nan

本文介绍了火花中的null和NaN之间的区别?怎么处理呢?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

在我的DataFrame中，有几列分别包含null和NaN的值，例如:

In my DataFrame, there are columns including values of null and NaN respectively, such as:

df = spark.createDataFrame([(1, float('nan')), (None, 1.0)], ("a", "b"))
df.show()

+----+---+
|   a|  b|
+----+---+
|   1|NaN|
|null|1.0|
+----+---+

两者之间有什么区别吗?如何处理?

Are there any difference between those? How can they be dealt with?

null 值表示无值"或无"，它甚至不是空字符串或零.它可以用来表示没有有用的东西.

null values represents "no value" or "nothing", it's not even an empty string or zero. It can be used to represent that nothing useful exists.

NaN代表不是数字"，通常是没有意义的数学运算的结果，例如0.0/0.0.

NaN stands for "Not a Number", it's usually the result of a mathematical operation that doesn't make sense, e.g. 0.0/0.0.

处理 null 值的一种可能方法是使用以下方法删除它们:

One possible way to handle null values is to remove them with:

df.na.drop()

或者您可以通过以下方式将它们更改为实际值(在这里我使用0):

Or you can change them to an actual value (here I used 0) with:

df.na.fill(0)

另一种方法是选择特定列为 null 的行以进行进一步处理:

Another way would be to select the rows where a specific column is null for further processing:

df.where(col("a").isNull())
df.where(col("a").isNotNull())

具有NaN的行也可以使用等效方法选择:

Rows with NaN can also be selected using the equivalent method:

from pyspark.sql.functions import isnan
df.where(isnan(col("a")))

这篇关于火花中的null和NaN之间的区别?怎么处理呢?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文