查找日期,当列的值更改时 [英] find the date, when value of column changed

查看:52
本文介绍了查找日期,当列的值更改时的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个DataFrame作为A,例如:

I had one DataFrame as A, like:

+---+---+---+---+----------+
|key| c1| c2| c3|      date|
+---+---+---+---+----------+
| k1| -1|  0| -1|2015-04-28|
| k1|  1| -1|  1|2015-07-28|
| k1|  1|  1|  1|2015-10-28|
| k1|  1|  1| -1|2015-12-28|
| k2| -1|  0|  1|2015-04-28|
| k2| -1|  1| -1|2015-07-28|
| k2|  1| -1|  0|2015-10-28|
| k2|  1| -1|  1|2015-11-28|
+---+---+---+---+----------+

创建 A 的代码:

 data = [('k1', '-1', '0', '-1','2015-04-28'),
    ('k1', '1', '-1', '1', '2015-07-28'),
    ('k1', '1', '1', '1', '2015-10-28'),
    ('k1', '1', '1', '-1', '2015-12-28'),
    ('k2', '-1', '0', '1', '2015-04-28'),
    ('k2', '-1', '1', '-1', '2015-07-28'),
    ('k2', '1', '-1', '0', '2015-10-28'),
    ('k2', '1', '-1', '1', '2015-11-28')]
A = spark.createDataFrame(data, ['key', 'c1', 'c2','c3','date'])
A = A.withColumn('date',A.date.cast('date'))

我想获取日期,这时列c3的值首次更改(按日期升序排列),预期结果如下:

I want to get the date, at which point the value of column c3 changed at the first time(order by the date as ascending), the expected result like:

+---+---+----------+
|key| c3|      date|
+---+---+----------+
| k1|  1|2015-07-28|
| k2| -1|2015-07-28|
+---+---+----------+

推荐答案

这是我使用UDF的解决方案.

Here is my solution using UDF.

import pyspark.sql.functions as func
from pyspark.sql.types import *

data = [('k1', '-1', '0', '-1','2015-04-28'),
        ('k1', '1', '-1', '1', '2015-07-28'),
        ('k1', '1', '1', '1', '2015-10-28'),
        ('k2', '-1', '0', '1', '2015-04-28'),
        ('k2', '-1', '1', '-1', '2015-07-28'),
        ('k2', '1', '-1', '0', '2015-10-28')]

# note that I didn't cast date type here
A = spark.createDataFrame(data, ['key', 'c1', 'c2','c3','date'])
A_group = A.select('key', 'c3', 'date').groupby('key')
A_agg = A_group.agg(func.collect_list(func.col('c3')).alias('c3'), 
                    func.collect_list(func.col('date')).alias('date_list'))

# UDF to return first change for given list
def find_change(c3_list, date_list):
    """return first change"""
    for i in range(1, len(c3_list)):
        if c3_list[i] != c3_list[i-1]:
            return [c3_list[i], date_list[i]]
    else:
        return None

udf_find_change = func.udf(find_change, returnType=ArrayType(StringType()))

# find first change given 
A_first_change = A_agg.select('key', udf_find_change(func.col('c3'), func.col('date_list')).alias('first_change'))

A_first_change.select('key', 
                      func.col('first_change').getItem(0).alias('c3'), 
                      func.col('first_change').getItem(1).alias('date').cast('date').show()

输出

+---+---+----------+
|key| c3|      date|
+---+---+----------+
| k2| -1|2015-07-28|
| k1|  1|2015-07-28|
+---+---+----------+

这篇关于查找日期,当列的值更改时的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆