在 Spark 中将不同大小的元组的 RDD 转换为数据帧 [英] Convert a RDD of Tuples of Varying Sizes to a DataFrame in Spark

查看:69
本文介绍了在 Spark 中将不同大小的元组的 RDD 转换为数据帧的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在使用 python 将以下结构的 RDD 转换为 spark 中的数据帧时遇到困难.

I am having difficulty in converting an RDD of the follwing structure to a dataframe in spark using python.

df1=[['usr1',('itm1',2),('itm3',3)], ['usr2',('itm2',3), ('itm3',5),(itm22,6)]]

转换后,我的数据框应如下所示:

After converting, my dataframe should look like the following:

       usr1  usr2
itm1    2.0   NaN
itm2    NaN   3.0
itm22   NaN   6.0
itm3    3.0   5.0

我最初考虑将上述 RDD 结构转换为以下内容:

I was initially thinking of coverting the above RDD structure to the following:

df1={'usr1': {'itm1': 2, 'itm3': 3}, 'usr2': {'itm2': 3, 'itm3': 5, 'itm22':6}}

然后使用python的pandas模块pand=pd.DataFrame(dat2),然后使用spark_df = context.createDataFrame(pand)将pands数据帧转换回spark数据帧.但是,我相信,通过这样做,我将 RDD 转换为非 RDD 对象,然后再转换回 RDD,这是不正确的.有人可以帮我解决这个问题吗?

Then use python's pandas module pand=pd.DataFrame(dat2) and then convert pandas dataframe back to a spark dataframe using spark_df = context.createDataFrame(pand). However, I beleive, by doing this, I am converting an RDD to a non-RDD object and then converting back to RDD, which is not correct. Can some please help me out with this problem?

推荐答案

有了这样的数据:

rdd = sc.parallelize([
    ['usr1',('itm1',2),('itm3',3)], ['usr2',('itm2',3), ('itm3',5),('itm22',6)]
])

压平记录:

def to_record(kvs):
    user, *vs = kvs  # For Python 2.x use standard indexing / splicing
    for item, value in vs:
        yield user, item, value

records = rdd.flatMap(to_record)

转换为DataFrame:

df = records.toDF(["user", "item", "value"])

枢轴:

result = df.groupBy("item").pivot("user").sum()

result.show()
## +-----+----+----+
## | item|usr1|usr2|
## +-----+----+----+
## | itm1|   2|null|
## | itm2|null|   3|
## | itm3|   3|   5|
## |itm22|null|   6|
## +-----+----+----+

注意:Spark DataFrames 旨在处理较长且相对较薄的数据.如果您想生成宽列联表,DataFrames 将没有用,特别是如果数据很密集并且您希望每个特征保留单独的列.

Note: Spark DataFrames are designed to handle long and relatively thin data. If you want to generate wide contingency table, DataFrames won't be useful, especially if data is dense and you want to keep separate column per feature.

这篇关于在 Spark 中将不同大小的元组的 RDD 转换为数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆