PySpark:获取数据框中每个列的第一个非空值 [英] PySpark: Get first Non-null value of each column in dataframe
问题描述
我正在处理不同的Spark DataFrames
,在许多列中都有很多Null值.我想从每个列中获取任何一个非空值,以查看该值是否可以转换为日期时间.
I'm dealing with different Spark DataFrames
, which have lot of Null values in many columns. I want to get any one non-null value from each of the column to see if that value can be converted to datetime.
我尝试做df.na.drop().first()
,希望它将所有具有空值的行都删除,而剩下的DataFrame
中,我将获得具有所有非空值的第一行.但是,许多DataFrames
的列中包含很多空值,因此df.na.drop()
返回空的DataFrame
.
I tried doing df.na.drop().first()
in a hope that it'll drop all rows with any null value, and of the remaining DataFrame
, I'll just get the first row with all non-null values. But many of the DataFrames
have so many columns with lot of null values, that df.na.drop()
returns empty DataFrame
.
我还尝试查找是否所有列都具有所有null
值,因此我可以在尝试上述方法之前简单地删除该列,但仍不能解决问题.我知道如何以有效的方式完成此操作,因为此代码将在巨大的DataFrames
上多次运行?
I also tried finding if any columns has all null
values, so that I could simply drop that columns before trying the above approach, but that still not solved the problem. Any idea how can I accomplish this in efficient way, as this code will be run many times on huge DataFrames
?
推荐答案
您可以将first
函数与ingorenulls
一起使用.假设数据看起来像这样:
You can use first
function with ingorenulls
. Let's say data looks like this:
from pyspark.sql.types import StringType, StructType, StructField
schema = StructType([
StructField("x{}".format(i), StringType(), True) for i in range(3)
])
df = spark.createDataFrame(
[(None, "foo", "bar"), ("foo", None, "bar"), ("foo", "bar", None)],
schema
)
您可以:
from pyspark.sql.functions import first
df.select([first(x, ignorenulls=True).alias(x) for x in df.columns]).first()
Row(x0='foo', x1='foo', x2='bar')
这篇关于PySpark:获取数据框中每个列的第一个非空值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!