从单个数据帧创建多个pyspark数据帧 [英] Creating multiple pyspark dataframes from a single dataframe

查看:56
本文介绍了从单个数据帧创建多个pyspark数据帧的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要根据python列表中的可用值在pyspark中动态创建多个数据框

I need to dynamically create multiple dataframes in pyspark based on the values available in the python list

我的数据框(df)具有数据:

My dataframe(df) has data:

date        gender balance
2018-01-01   M     100
2018-02-01   F     100
2018-03-01   M     100

my_list = [2018-01-01, 2018-02-01, 2018-03-01]
for i in my_list:
  df_i = df.select("*").filter("date=i").limit(1000)

能请你帮忙吗?

推荐答案

我不确定是否可以在PySpark中动态创建数据帧的名称.在Python中,您甚至不能动态地动态分配变量的名称,更不用说dataframes了.

I am not sure if you can create the names of dataframes dynamically in PySpark. In Python, you cannot even dynamically assign the names of variables, let alone dataframes.

一种方法是创建dataframes的字典,其中key对应于每个date,而该字典的value对应于数据帧.

One way is to create a dictionary of the dataframes, where the key corresponds to each date and the value of that dictionary corresponds to the dataframe.

对于Python:参见此

For Python: Refer to this link, where someone has asked a similar Q on name dynamism.

这是一个小的PySpark实现-

from pyspark.sql.functions import col
values = [('2018-01-01','M',100),('2018-02-01','F',100),('2018-03-01','M',100)]
df = sqlContext.createDataFrame(values,['date','gender','balance'])
df.show()
+----------+------+-------+
|      date|gender|balance|
+----------+------+-------+
|2018-01-01|     M|    100|
|2018-02-01|     F|    100|
|2018-03-01|     M|    100|
+----------+------+-------+

# Creating a dictionary to store the dataframes.
# Key: It contains the date from my_list.
# Value: Contains the corresponding dataframe.
dictionary_df = {}  

my_list = ['2018-01-01', '2018-02-01', '2018-03-01']
for i in my_list:
    dictionary_df[i] = df.filter(col('date')==i)

for i in my_list:
    print('DF: '+i)
    dictionary_df[i].show() 

DF: 2018-01-01
+----------+------+-------+
|      date|gender|balance|
+----------+------+-------+
|2018-01-01|     M|    100|
+----------+------+-------+

DF: 2018-02-01
+----------+------+-------+
|      date|gender|balance|
+----------+------+-------+
|2018-02-01|     F|    100|
+----------+------+-------+

DF: 2018-03-01
+----------+------+-------+
|      date|gender|balance|
+----------+------+-------+
|2018-03-01|     M|    100|
+----------+------+-------+

print(dictionary_df)
    {'2018-01-01': DataFrame[date: string, gender: string, balance: bigint], '2018-02-01': DataFrame[date: string, gender: string, balance: bigint], '2018-03-01': DataFrame[date: string, gender: string, balance: bigint]}

这篇关于从单个数据帧创建多个pyspark数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆