如何在pyspark中按列合并多个数据帧? [英] How to merge several dataframes column-wise in pyspark?

查看:311
本文介绍了如何在pyspark中按列合并多个数据帧?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有大约 25 个表,每个表有 3 列(id、date、value),我需要通过连接 id 和 date 列从每个表中选择值列并创建合并表.

I have around 25 tables and each table has 3 columns(id , date , value) where i would need to select the value column from each of them by joining with id and date column and create a merged table.

df_1 = df_1.join(
    df_2, 
    on=(df_1.id == df_2.id) & (df_1.date == df_2.date),
    how="inner"
).select([df_1["*"], df_2["value1"]]).dropDuplicates()

pyspark 中是否有任何优化的方法来生成具有这 25 个值 + id+ 日期列的合并表.

Is there any optimised way in pyspark to generate this merged table having these 25 values + id+ date column.

提前致谢.

推荐答案

df_1 = spark.createDataFrame([[1, '2018-10-10', 3]], ['id', 'date', 'value'])
df_2 = spark.createDataFrame([[1, '2018-10-10', 3], [2, '2018-10-10', 4]], ['id', 'date', 'value'])
df_3 = spark.createDataFrame([[1, '2018-10-10', 3], [2, '2018-10-10', 4]], ['id', 'date', 'value'])

from functools import reduce

# list of data frames / tables
dfs = [df_1, df_2, df_3]

# rename value column
dfs_renamed = [df.selectExpr('id', 'date', f'value as value_{i}') for i, df in enumerate(dfs)]

# reduce the list of data frames with inner join
reduce(lambda x, y: x.join(y, ['id', 'date'], how='inner'), dfs_renamed).show()
+---+----------+-------+-------+-------+
| id|      date|value_0|value_1|value_2|
+---+----------+-------+-------+-------+
|  1|2018-10-10|      3|      3|      3|
+---+----------+-------+-------+-------+

这篇关于如何在pyspark中按列合并多个数据帧?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆