计算数据帧 Spark 中缺失值的数量 [英] Count the number of missing values in a dataframe Spark
问题描述
我有一个带有缺失值的 dataset
,我想获取每列缺失值的数量.以下是我所做的,我得到了非缺失值的数量.如何使用它来获取缺失值的数量?
I have a dataset
with missing values , I would like to get the number of missing values for each columns. Following is what I did , I got the number of non missing values. How can I use it to get the number of missing values?
df.describe().filter($"summary" === "count").show
+-------+---+---+---+
|summary| x| y| z|
+-------+---+---+---+
| count| 1| 2| 3|
+-------+---+---+---+
如有任何帮助,请获取 dataframe
,我们将在其中找到列和每个缺失值的数量.
Any help please to get a dataframe
in which we'll find columns and number of missing values for each one.
非常感谢
推荐答案
您可以通过对 isNull()
方法的布尔输出求和,将其转换为整数类型后计算缺失值:
You could count the missing values by summing the boolean output of the isNull()
method, after converting it to type integer:
在 Scala
中:
import org.apache.spark.sql.functions.{sum, col}
df.select(df.columns.map(c => sum(col(c).isNull.cast("int")).alias(c)): _*).show
在 Python
中:
from pyspark.sql.functions import col,sum
df.select(*(sum(col(c).isNull().cast("int")).alias(c) for c in df.columns)).show()
或者,您也可以使用 df.describe().filter($"summary" === "count")
的输出,并将每个单元格中的数字减去数据中的行:
Alternatively, you could also use the output of df.describe().filter($"summary" === "count")
, and subtract the number in each cell by the number of rows in the data:
在 Scala
中:
import org.apache.spark.sql.functions.lit,
val rows = df.count()
val summary = df.describe().filter($"summary" === "count")
summary.select(df.columns.map(c =>(lit(rows) - col(c)).alias(c)): _*).show
在 Python
中:
from pyspark.sql.functions import lit
rows = df.count()
summary = df.describe().filter(col("summary") == "count")
summary.select(*((lit(rows)-col(c)).alias(c) for c in df.columns)).show()
这篇关于计算数据帧 Spark 中缺失值的数量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!