PySpark:获取数据框中每列的第一个非空值 [英] PySpark: Get first Non-null value of each column in dataframe

查看：36 发布时间：2021/11/14 22:03:04 python apache-spark dataframe pyspark apache-spark-sql

本文介绍了PySpark:获取数据框中每列的第一个非空值的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在处理不同的 Spark DataFrames，它们在许多列中有很多 Null 值.我想从每一列中获取任何一个非空值，以查看该值是否可以转换为日期时间.

我尝试做 df.na.drop().first() 希望它能删除所有具有任何空值的行，以及剩余的 DataFrame，我只会得到所有非空值的第一行.但是许多 DataFrames 有很多包含大量空值的列，以至于 df.na.drop() 返回空的 DataFrame.

我还尝试查找是否有任何列具有所有 null 值，以便在尝试上述方法之前我可以简单地删除该列，但这仍然没有解决问题.知道如何以有效的方式完成此操作，因为此代码将在巨大的 DataFrames 上运行多次?

解决方案

您可以将 first 函数与 ingorenulls 一起使用.假设数据如下所示:

from pyspark.sql.types import StringType, StructType, StructField架构 = 结构类型([StructField("x{}".format(i), StringType(), True) for i in range(3)])df = spark.createDataFrame([(None, "foo", "bar"), ("foo", None, "bar"), ("foo", "bar", None)],模式)

您可以:

from pyspark.sql.functions 先导入df.select([first(x, ignorenulls=True).alias(x) for x in df.columns]).first()

Row(x0='foo', x1='foo', x2='bar')

I'm dealing with different Spark DataFrames, which have lot of Null values in many columns. I want to get any one non-null value from each of the column to see if that value can be converted to datetime.

I tried doing df.na.drop().first() in a hope that it'll drop all rows with any null value, and of the remaining DataFrame, I'll just get the first row with all non-null values. But many of the DataFrames have so many columns with lot of null values, that df.na.drop() returns empty DataFrame.

I also tried finding if any columns has all null values, so that I could simply drop that columns before trying the above approach, but that still not solved the problem. Any idea how can I accomplish this in efficient way, as this code will be run many times on huge DataFrames?

解决方案

You can use first function with ingorenulls. Let's say data looks like this:

from pyspark.sql.types import StringType, StructType, StructField

schema = StructType([
    StructField("x{}".format(i), StringType(), True) for i in range(3)
])

df = spark.createDataFrame(
    [(None, "foo", "bar"), ("foo", None, "bar"), ("foo", "bar", None)],
    schema
)

You can:

from pyspark.sql.functions import first

df.select([first(x, ignorenulls=True).alias(x) for x in df.columns]).first()

Row(x0='foo', x1='foo', x2='bar')

这篇关于PySpark:获取数据框中每列的第一个非空值的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

PySpark:获取数据框中每列的第一个非空值 [英] PySpark: Get first Non-null value of each column in dataframe

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

PySpark:获取数据框中每列的第一个非空值 [英] PySpark: Get first Non-null value of each column in dataframe

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭