na().drop()和filter(col.isNotNull)之间的区别(Apache Spark) [英] Difference between na().drop() and filter(col.isNotNull) (Apache Spark)
问题描述
df.na().drop()
和df.filter(df.col("onlyColumnInOneColumnDataFrame").isNotNull() && !df.col("onlyColumnInOneColumnDataFrame").isNaN())
之间的语义是否有区别,其中df
是 Apache Spark Dataframe
?
或者如果第一个不随后在onlyColumnInOneColumnDataFrame
列中返回null
(不是String null,而只是一个null
值),而第二个没有返回,我是否将其视为错误? >
也添加了!isNaN()
. onlyColumnInOneColumnDataFrame
是给定Dataframe
中的唯一列.假设它的类型是Integer
.
使用df.na.drop()
删除包含任何 null或NaN值的行.
使用df.filter(df.col("onlyColumnInOneColumnDataFrame").isNotNull())
,可将仅在onlyColumnInOneColumnDataFrame
列中为空的行删除.
如果您想实现同一目标,那就是df.na.drop(["onlyColumnInOneColumnDataFrame"])
.
Is there any difference in semantics between df.na().drop()
and df.filter(df.col("onlyColumnInOneColumnDataFrame").isNotNull() && !df.col("onlyColumnInOneColumnDataFrame").isNaN())
where df
is Apache Spark Dataframe
?
Or shall I consider it as a bug if the first one does NOT return afterwards null
(not a String null, but simply a null
value) in the column onlyColumnInOneColumnDataFrame
and the second one does?
EDIT: added !isNaN()
as well. The onlyColumnInOneColumnDataFrame
is the only column in the given Dataframe
. Let's say it's type is Integer
.
With df.na.drop()
you drop the rows containing any null or NaN values.
With df.filter(df.col("onlyColumnInOneColumnDataFrame").isNotNull())
you drop those rows which have null only in the column onlyColumnInOneColumnDataFrame
.
If you would want to achieve the same thing, that would be df.na.drop(["onlyColumnInOneColumnDataFrame"])
.
这篇关于na().drop()和filter(col.isNotNull)之间的区别(Apache Spark)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!