Pyspark:Concat函数将列生成到新数据框中 [英] Pyspark: Concat function generated columns into new dataframe

查看:48
本文介绍了Pyspark:Concat函数将列生成到新数据框中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个带有n个列的pyspark数据帧(df),我想生成另一个n个列的df,其中每一列都在相应的原始df列中记录了连续黑白行的百分比差异。并且新df中的列标题应为==旧​​数据帧中的对应列标题+ _diff。
使用以下代码,我可以为原始df中的每一列生成百分比变化的新列,但无法将它们粘贴在具有合适列标题的新df中:

I have a pyspark dataframe (df) with n cols, I would like to generate another df of n cols, where each column records the percentage difference b/w consecutive rows in the corresponding, original df column. And the column headers in the new df should be == corresponding column header in old dataframe + "_diff". With the following code I can generate the new columns of percentage changes for each column in the original df but am not able to stick them in a new df with suitable column headers:

from pyspark.sql import SparkSession
from pyspark.sql.window import Window
import pyspark.sql.functions as func

spark = (SparkSession
            .builder
            .appName('pct_change')
            .enableHiveSupport()
            .getOrCreate())

df = spark.createDataFrame([(1, 10, 11, 12), (2, 20, 22, 24), (3, 30, 33, 36)], 
                       ["index", "col1", "col2", "col3"])
w = Window.orderBy("index")

for i in range(1, len(df.columns)):
    col_pctChange = func.log(df[df.columns[i]]) - func.log(func.lag(df[df.columns[i]]).over(w))

谢谢

推荐答案



<在这种情况下,您可以在ac内进行列表理解全部选择 select

为了使代码更加紧凑,我们首先可以获取想要的列列表中的差异:

To make the code a little more compact, we can first get the columns we want to diff in a list:

diff_columns = [c for c in df.columns if c != 'index']

下一步选择索引并遍历 diff_columns 计算新列。使用 .alias()重命名结果列:

Next select the index and iterate over diff_columns to compute the new column. Use .alias() to rename the resulting column:

df_diff = df.select(
    'index',
    *[(func.log(func.col(c)) - func.log(func.lag(func.col(c)).over(w))).alias(c + "_diff")
      for c in diff_columns]
)
df_diff.show()
#+-----+------------------+-------------------+-------------------+
#|index|         col1_diff|          col2_diff|          col3_diff|
#+-----+------------------+-------------------+-------------------+
#|    1|              null|               null|               null|
#|    2| 0.693147180559945| 0.6931471805599454| 0.6931471805599454|
#|    3|0.4054651081081646|0.40546510810816416|0.40546510810816416|
#+-----+------------------+-------------------+-------------------+

这篇关于Pyspark:Concat函数将列生成到新数据框中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆