pyspark动态列计算 [英] pyspark dynamic column computation

查看:136
本文介绍了pyspark动态列计算的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

下面是我的火花数据框

  abc 
1 3 4
2 0 0
4 1 0
2 2 0

我的输出应该如下

  abc 
1 3 4
2 0 2
4 1 -1
2 2 3

公式为 prev(c)-b + a 4-2 + 0 = 2 2-4 + 1 = -1



任何人都可以帮我渡过这个障碍吗? 来自pyspark.sql.functions的从pyspark.sql.types导入lag,udf
从pyspark.sql.window导入IntegerType
导入窗口

数字= [[1,2,3],[2,3,4],[3,4,5],[5,6,7]]
df = sc.parallelize(numbers).toDF([ a),b','c'])
df.show()
$ bw = Window()。partitionBy()。orderBy('a')
calculate = udf(lambda a,b,c:a-b + c,IntegerType())
df = df.withColumn('result',lag(a)。over(w)-df.b + df .c)
df.show()



+ --- + --- + --- +
| A | C | ç|
+ --- + --- + --- +
| 1 | 2 | 3 |
| 2 | 3 | 4 |
| 3 | 4 | 5 |
| 5 | 6 | 7 |
+ --- + --- + --- +

+ --- + --- + --- + ------ +
| A | C | ç|结果|
+ --- + --- + --- + ------ +
| 1 | 2 | 3 |空|
| 2 | 3 | 4 | 2 |
| 3 | 4 | 5 | 3 |
| 5 | 6 | 7 | 4 |
+ --- + --- + --- + ------ +


Below is my spark data frame

a b c
1 3 4
2 0 0
4 1 0
2 2 0

My output should be as below

a b c
1 3 4
2 0 2
4 1 -1
2 2 3

Formula is prev(c)-b+a i.e, 4-2+0=2 and 2-4+1=-1

Can anyone please help me to cross this hurdle?

解决方案

from pyspark.sql.functions import lag, udf
from pyspark.sql.types import IntegerType
from pyspark.sql.window import Window

numbers = [[1,2,3],[2,3,4],[3,4,5],[5,6,7]]
df = sc.parallelize(numbers).toDF(['a','b','c'])
df.show()

w = Window().partitionBy().orderBy('a')
calculate = udf(lambda a,b,c:a-b+c,IntegerType())
df = df.withColumn('result', lag("a").over(w)-df.b+df.c)
df.show()



+---+---+---+
|  a|  b|  c|
+---+---+---+
|  1|  2|  3|
|  2|  3|  4|
|  3|  4|  5|
|  5|  6|  7|
+---+---+---+

+---+---+---+------+
|  a|  b|  c|result|
+---+---+---+------+
|  1|  2|  3|  null|
|  2|  3|  4|     2|
|  3|  4|  5|     3|
|  5|  6|  7|     4|
+---+---+---+------+

这篇关于pyspark动态列计算的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆