pySpark/Python 遍历数据框列,检查条件并填充另一个列 [英] pySpark/Python iterate through dataframe columns, check for a condition and populate another colum

查看:22
本文介绍了pySpark/Python 遍历数据框列,检查条件并填充另一个列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在 Jupyter Notebook 中使用 python/pySpark,我试图弄清楚以下几点:

I am working with python/pySpark in Jupyter Notebook and I am trying to figure out the following:

我有一个类似的数据框

MainDate      Date1        Date2        Date3         Date4
2015-10-25    2015-09-25   2015-10-25   2015-11-25    2015-12-25
2012-07-16    2012-04-16   2012-05-16   2012-06-16    2012-07-16
2005-03-14    2005-07-14   2005-08-14   2005-09-14    2005-10-14

我需要将 MainDate 与 Date1-Date4 中的每一个进行比较,如果 MainDate == Date# 然后创建一个新列 REAL = Date#,如果没有匹配则 REAL = "None",所有日期都在日期格式,真实的数据帧也有 Date1 到 Date72 并且可能只有一个匹配,如果有的话

I need to compare MainDate with each of the Date1-Date4 and if MainDate == Date# then to create a new column REAL = Date#, if there is no match then REAL = "None", all the dates are in Date format, also the real dataframe has Date1 to Date72 and there could be only one match, if there is any

最终结果:

MainDate      Date1        Date2        Date3         Date4        REAL
2015-10-25    2015-09-25   2015-10-25   2015-11-25    2015-12-25   Date2
2012-07-16    2012-04-16   2012-05-16   2012-06-16    2012-07-16   Date4
2005-03-14    2005-07-14   2005-08-14   2005-09-14    2005-10-14   None

提前致谢

推荐答案

我会使用 coalesce:

from pyspark.sql.functions import col, when, coalesce, lit

df = spark.createDataFrame([
    ("2015-10-25", "2015-09-25", "2015-10-25", "2015-11-25", "2015-12-25"),
    ("2012-07-16", "2012-04-16", "2012-05-16", "2012-06-16", "2012-07-16"),
    ("2005-03-14", "2005-07-14", "2005-08-14", "2005-09-14", "2005-10-14"),],
    ("MainDate", "Date1", "Date2", "Date3", "Date4")
)


df.withColumn("REAL", 
    coalesce(*[when(col(c) == col("MainDate"), lit(c)) for c in df.columns[1:]])
).show()


+----------+----------+----------+----------+----------+-----+
|  MainDate|     Date1|     Date2|     Date3|     Date4| REAL|
+----------+----------+----------+----------+----------+-----+
|2015-10-25|2015-09-25|2015-10-25|2015-11-25|2015-12-25|Date2|
|2012-07-16|2012-04-16|2012-05-16|2012-06-16|2012-07-16|Date4|
|2005-03-14|2005-07-14|2005-08-14|2005-09-14|2005-10-14| null|
+----------+----------+----------+----------+----------+-----+

哪里

when(col(c) == col("MainDate"), lit(c))

如果匹配则返回列名 (lit(c)),否则返回 NULL.

returns column name (lit(c)) if there is a match, or NULL otherwise.

这应该比 udf 或转换为 RDD 快得多.

This should be much faster than udf or conversion to RDD.

这篇关于pySpark/Python 遍历数据框列,检查条件并填充另一个列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆