如何使用窗口功能使用动态选择查询选择列 [英] How to select columns using dynamic select query using window function

查看:69
本文介绍了如何使用窗口功能使用动态选择查询选择列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下示例输入数据帧,但是(以m开头的clm)列的值可以是n个数字.

I have sample input dataframe as below, but the value (clm starting with m) columns can be n number.

customer_id|month_id|m1  |m2 |m3 .......m_n
1001       |  01    |10  |20    
1002       |  01    |20  |30    
1003       |  01    |30  |40
1001       |  02    |40  |50    
1002       |  02    |50  |60    
1003       |  02    |60  |70
1001       |  03    |70  |80    
1002       |  03    |80  |90    
1003       |  03    |90  |100

现在,我必须通过按月分组来基于累积总和来创建新列.因此,我使用了窗口功能.因为,我将有n列而不是forfor循环使用withColumn,所以我需要动态创建查询或列表并将其传递给selectExpr以计算新列.

Now, I have to create new columns based on the cummulative sum by grouping on each month. Hence, I have used window function. As, I will have n number of columns instead of withColumn with for loop, I need to create a query or list dynamically and pass it to the selectExpr to calculate the new columns.

例如:

rownum_window = (Window.partitionBy("partner_id").orderBy("month_id").rangeBetween(Window.unboundedPreceding, 0))
df = df.select("*", F.sum(col("m1")).over(rownum_window).alias("n1"))

但是,我想准备一个动态表达式,然后需要传递给选择的数据框.我该怎么办?

But, I want to prepare a dynamic expression and then I need to pass to the dataframe select. How can I do that?

LIKE: expr = ["F.sum(col("m1")).over(rownum_window).alias("n1")", "F.sum(col("m2")).over(rownum_window).alias("n2")", "F.sum(col("m3")).over(rownum_window).alias("n3")", .......]
df = df.select("*', expr)

或者以其他任何方式选择数据框,我可以创建选择表达式吗?

Or any other way of dataframe select I can create the select expression?

输出:

customer_id|month_id|m1     |m2    |n1   |n2  
1001       |  01    |10     |20    |10   |20  
1002       |  01    |20     |30    |20   |30  
1003       |  01    |30     |40    |30   |40  
1001       |  02    |40     |50    |50   |70  
1002       |  02    |50     |60    |70   |90
1003       |  02    |60     |70    |90   |110  
1001       |  03    |70     |80    |120  |150
1002       |  03    |80     |90    |150  |180
1003       |  03    |90     |100   |180  |210

推荐答案

已更新:

import pyspark.sql.functions as F
from pyspark.sql import Window

rownum_window = Window.partitionBy("customer_id").orderBy("month_id").rangeBetween(Window.unboundedPreceding, 0)

expr = [F.sum(F.col("m1")).over(rownum_window).alias("n1"), F.sum(F.col("m2")).over(rownum_window).alias("n2")]
df.select('*', *expr) \
  .orderBy('month_id', 'customer_id') \
  .show(10, False)

+-----------+--------+---+---+---+---+
|customer_id|month_id|m1 |m2 |n1 |n2 |
+-----------+--------+---+---+---+---+
|1001       |1       |10 |20 |10 |20 |
|1002       |1       |20 |30 |20 |30 |
|1003       |1       |30 |40 |30 |40 |
|1001       |2       |40 |50 |50 |70 |
|1002       |2       |50 |60 |70 |90 |
|1003       |2       |60 |70 |90 |110|
|1001       |3       |70 |80 |120|150|
|1002       |3       |80 |90 |150|180|
|1003       |3       |90 |100|180|210|
+-----------+--------+---+---+---+---+


尝试一下.


Try this.

expr = [F.sum(col("m1")).over(rownum_window).alias("n1"), F.sum(col("m2")).over(rownum_window).alias("n2"), ...]
df = df.select('*', *expr)

这篇关于如何使用窗口功能使用动态选择查询选择列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆