将列表转换为数据框,然后在 pyspark 中加入不同的数据框 [英] Convert list to dataframe and then join with different dataframe in pyspark

查看:76
本文介绍了将列表转换为数据框,然后在 pyspark 中加入不同的数据框的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 pyspark 数据框.

I am working with pyspark dataframes.

我有一个日期类型值列表:

I have a list of date type values:

date_list = ['2018-01-19', '2018-01-20', '2018-01-17']

我还有一个只有一列(平均值)的数据框(mean_df).

Also I have a dataframe (mean_df) that has only one column (mean).

+----+
|mean|
+----+
|67  |
|78  |
|98  |
+----+

现在我想将 date_list 转换为一列并加入 mean_df:

Now I want to convert date_list into a column and join with mean_df:

预期输出:

+------------+----+
|dates       |mean|
+------------+----+
|2018-01-19  |  67|
|2018-01-20  |  78|
|2018-01-17  |  98|
+------------+----+

我尝试将列表转换为数据框 (date_df) :

I tried converting list to dataframe (date_df) :

date_df = spark.createDataFrame([(l,) for l in date_list], ['dates'])

然后使用 monotonically_increasing_id() 和新的列名idx";对于 date_df 和 mean_df 并使用 join :

and then used monotonically_increasing_id() with new column name "idx" for both date_df and mean_df and used join :

date_df = mean_df.join(date_df, mean_df.idx == date_df.idx).drop("idx")

我收到超时错误,因此我将默认的 broadcastTimeout 300s 更改为 6000s

I get error of timeout exceeded so I changed default broadcastTimeout 300s to 6000s

spark.conf.set("spark.sql.broadcastTimeout", 6000)

但它根本不起作用.此外,我现在正在处理一个非常小的数据样本.实际数据足够大.

But it did not work at all. Also I am working with a really small sample of data right now. The actual data is large enough.

代码片段:

date_list = ['2018-01-19', '2018-01-20', '2018-01-17']
mean_list = []


for d in date_list:
    h2_df1, h2_df2 = hypo_2(h2_df, d, 2)
    
    mean1 = h2_df1.select(_mean(col('count_before')).alias('mean_before'))
   
    mean_list.append(mean1)
    
    
mean_df = reduce(DataFrame.unionAll, mean_list)

推荐答案

您可以使用 withColumnlit 将日期添加到数据框:

You can use withColumn and lit to add the date to the dataframe:

import pyspark.sql.functions as F

date_list = ['2018-01-19', '2018-01-20', '2018-01-17']
mean_list = []

for d in date_list:
    h2_df1, h2_df2 = hypo_2(h2_df, d, 2)
    
    mean1 = h2_df1.select(F.mean(F.col('count_before')).alias('mean_before')).withColumn('date', F.lit(d))
   
    mean_list.append(mean1)
    
    
mean_df = reduce(DataFrame.unionAll, mean_list)

这篇关于将列表转换为数据框,然后在 pyspark 中加入不同的数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆