如何有效地查找PySpark数据帧中每列的Null和Nan值计数? [英] How to find count of Null and Nan values for each column in a PySpark dataframe efficiently?
本文介绍了如何有效地查找PySpark数据帧中每列的Null和Nan值计数?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
import numpy as np
df = spark.createDataFrame(
[(1, 1, None), (1, 2, float(5)), (1, 3, np.nan), (1, 4, None), (1, 5, float(10)), (1, 6, float('nan')), (1, 6, float('nan'))],
('session', "timestamp1", "id2"))
预期输出
每列的数值为nan/null的数据框
dataframe with count of nan/null for each column
注意: 我在堆栈溢出中发现的先前问题仅检查是否为null& ;;不是南. 这就是为什么我提出了一个新问题.
Note: The previous questions I found in stack overflow only checks for null & not nan. Thats why i have created a new question.
我知道我可以在spark中使用isnull()函数在Spark列中查找Null值的数目,但是如何在Spark数据帧中查找Nan值?
I know i can use isnull() function in spark to find number of Null values in Spark column but how to find Nan values in Spark dataframe?
推荐答案
您可以在此处使用显示的方法,并替换isNull
与isnan
:
You can use method shown here and replace isNull
with isnan
:
from pyspark.sql.functions import isnan, when, count, col
df.select([count(when(isnan(c), c)).alias(c) for c in df.columns]).show()
+-------+----------+---+
|session|timestamp1|id2|
+-------+----------+---+
| 0| 0| 3|
+-------+----------+---+
或
df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df.columns]).show()
+-------+----------+---+
|session|timestamp1|id2|
+-------+----------+---+
| 0| 0| 5|
+-------+----------+---+
这篇关于如何有效地查找PySpark数据帧中每列的Null和Nan值计数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文