如何使用窗口功能使用动态选择查询选择列 [英] How to select columns using dynamic select query using window function
问题描述
我有以下示例输入数据帧,但是(以m开头的clm)列的值可以是n个数字.
I have sample input dataframe as below, but the value (clm starting with m) columns can be n number.
customer_id|month_id|m1 |m2 |m3 .......m_n
1001 | 01 |10 |20
1002 | 01 |20 |30
1003 | 01 |30 |40
1001 | 02 |40 |50
1002 | 02 |50 |60
1003 | 02 |60 |70
1001 | 03 |70 |80
1002 | 03 |80 |90
1003 | 03 |90 |100
现在,我必须通过按月分组来基于累积总和来创建新列.因此,我使用了窗口功能.因为,我将有n列而不是forfor循环使用withColumn,所以我需要动态创建查询或列表并将其传递给selectExpr以计算新列.
Now, I have to create new columns based on the cummulative sum by grouping on each month. Hence, I have used window function. As, I will have n number of columns instead of withColumn with for loop, I need to create a query or list dynamically and pass it to the selectExpr to calculate the new columns.
例如:
rownum_window = (Window.partitionBy("partner_id").orderBy("month_id").rangeBetween(Window.unboundedPreceding, 0))
df = df.select("*", F.sum(col("m1")).over(rownum_window).alias("n1"))
但是,我想准备一个动态表达式,然后需要传递给选择的数据框.我该怎么办?
But, I want to prepare a dynamic expression and then I need to pass to the dataframe select. How can I do that?
LIKE: expr = ["F.sum(col("m1")).over(rownum_window).alias("n1")", "F.sum(col("m2")).over(rownum_window).alias("n2")", "F.sum(col("m3")).over(rownum_window).alias("n3")", .......]
df = df.select("*', expr)
或者以其他任何方式选择数据框,我可以创建选择表达式吗?
Or any other way of dataframe select I can create the select expression?
输出:
customer_id|month_id|m1 |m2 |n1 |n2
1001 | 01 |10 |20 |10 |20
1002 | 01 |20 |30 |20 |30
1003 | 01 |30 |40 |30 |40
1001 | 02 |40 |50 |50 |70
1002 | 02 |50 |60 |70 |90
1003 | 02 |60 |70 |90 |110
1001 | 03 |70 |80 |120 |150
1002 | 03 |80 |90 |150 |180
1003 | 03 |90 |100 |180 |210
推荐答案
已更新:
import pyspark.sql.functions as F
from pyspark.sql import Window
rownum_window = Window.partitionBy("customer_id").orderBy("month_id").rangeBetween(Window.unboundedPreceding, 0)
expr = [F.sum(F.col("m1")).over(rownum_window).alias("n1"), F.sum(F.col("m2")).over(rownum_window).alias("n2")]
df.select('*', *expr) \
.orderBy('month_id', 'customer_id') \
.show(10, False)
+-----------+--------+---+---+---+---+
|customer_id|month_id|m1 |m2 |n1 |n2 |
+-----------+--------+---+---+---+---+
|1001 |1 |10 |20 |10 |20 |
|1002 |1 |20 |30 |20 |30 |
|1003 |1 |30 |40 |30 |40 |
|1001 |2 |40 |50 |50 |70 |
|1002 |2 |50 |60 |70 |90 |
|1003 |2 |60 |70 |90 |110|
|1001 |3 |70 |80 |120|150|
|1002 |3 |80 |90 |150|180|
|1003 |3 |90 |100|180|210|
+-----------+--------+---+---+---+---+
尝试一下.
Try this.
expr = [F.sum(col("m1")).over(rownum_window).alias("n1"), F.sum(col("m2")).over(rownum_window).alias("n2"), ...]
df = df.select('*', *expr)
这篇关于如何使用窗口功能使用动态选择查询选择列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!