如何从与pyspark中的前一年相同的列中减去行值? [英] How to subtract row value from same column with previous year in pyspark?

查看:44
本文介绍了如何从与pyspark中的前一年相同的列中减去行值?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有这样的输入数据,以年名称作为列,我想用新列作为评分差异从今年减去过去一年的评分.

I have input data like this, with year name as a column, I want to subtract rating of past year from the present year with a new column as rating diff.

year,movie_name,language,rating  
2019,akash,english,10   
2019,antony,kannada,9   
2020,akash,english,10   
2020,antony,kannada,8

我的结果数据框:我想要的

My result dataframe: which I want

year,movie_name,language,rating,rating_diff  
2019,akash,english,10,-  
2019,antony,kannada,9,-  
2020,akash,english,10,0        
2020,antony,kannada,8,-1

我们将不胜感激,谢谢.

any help would be appreciated, thank you.

推荐答案

由于您要比较上一年,因此 year 应该按列排序.但也要保留电影名称语言,它们应该按列进行分区.

Since you want to compare the last year, the year should be the order by column. But also want to keep the movie_name and language, they should be the partition by column.

将行中的评级与最近的年份进行比较,当 date 的顺序升序时,使用 lag 函数.

Compare the rating on the row with the latest year, the lag function is used when the order of the date is ascending.

import pyspark.sql.functions as f
from pyspark.sql import Window

w = Window.partitionBy('movie_name', 'language').orderBy('year')

df.withColumn('lead', f.lag('rating', 1).over(w)) \
  .withColumn('rating_diff', f.when(f.col('lead').isNotNull(), f.col('rating') - f.col('lead')).otherwise(f.lit(None))) \
  .show(10, False)

+----+----------+--------+------+----+-----------+
|year|movie_name|language|rating|lead|rating_diff|
+----+----------+--------+------+----+-----------+
|2019|antony    |kannada |9     |null|null       |
|2020|antony    |kannada |8     |9   |-1         |
|2019|akash     |english |10    |null|null       |
|2020|akash     |english |10    |10  |0          |
+----+----------+--------+------+----+-----------+

这篇关于如何从与pyspark中的前一年相同的列中减去行值?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆