计算数据帧Spark中缺少值的数量 [英] Count the number of missing values in a dataframe Spark

查看:551
本文介绍了计算数据帧Spark中缺少值的数量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个dataset缺少值,我想获取每列的缺少值的数量.以下是我所做的事情,我得到了一些不丢失的值.如何使用它来获取缺失值的数量?

I have a dataset with missing values , I would like to get the number of missing values for each columns. Following is what I did , I got the number of non missing values. How can I use it to get the number of missing values?

df.describe().filter($"summary" === "count").show

+-------+---+---+---+
|summary|  x|  y|  z|
+-------+---+---+---+
|  count|  1|  2|  3|
+-------+---+---+---+

任何帮助请获取dataframe,我们将在其中找到每个列和缺失值的数量.

Any help please to get a dataframe in which we'll find columns and number of missing values for each one.

非常感谢

推荐答案

您可以通过将isNull()方法的布尔输出求和,然后将其转换为整数类型,来计算缺失值:

You could count the missing values by summing the boolean output of the isNull() method, after converting it to type integer:

Scala中:

import org.apache.spark.sql.functions.{sum, col}
df.select(df.columns.map(c => sum(col(c).isNull.cast("int")).alias(c)): _*).show

Python中:

from pyspark.sql.functions import col,sum
df.select(*(sum(col(c).isNull().cast("int")).alias(c) for c in df.columns)).show()

或者,您也可以使用df.describe().filter($"summary" === "count")的输出,并通过数据中的行数减去每个单元格中的数字:

Alternatively, you could also use the output of df.describe().filter($"summary" === "count"), and subtract the number in each cell by the number of rows in the data:

Scala中:

import org.apache.spark.sql.functions.lit,

val rows = df.count()
val summary = df.describe().filter($"summary" === "count")
summary.select(df.columns.map(c =>(lit(rows) - col(c)).alias(c)): _*).show

Python中:

from pyspark.sql.functions import lit

rows = df.count()
summary = df.describe().filter(col("summary") == "count")
summary.select(*((lit(rows)-col(c)).alias(c) for c in df.columns)).show()

这篇关于计算数据帧Spark中缺少值的数量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆