将嵌套字典转换为 Pyspark 数据框 [英] Convert Nested dictionary to Pyspark Dataframe

查看:56
本文介绍了将嵌套字典转换为 Pyspark 数据框的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

向程序员同事问好.

我最近开始使用 pyspark 并且来自熊猫背景.我需要计算用户在数据中的相似度.由于我无法从 pyspark 中找到,我使用 python 字典来创建一个相似性数据框.

I have recently started with pyspark and comes from a pandas background. I need to compute similarity of user in a data against each other. As I couldn't find from pyspark I resorted to use python dictionary to create a similarity dataframe.

但是,我没有将嵌套字典转换为 pyspark Dataframe 的想法.能否请您为我提供一个方向,以实现这一预期结果.

However, I run out of ideas to convert a nested dictionary into a pyspark Dataframe. Could you please provide me a direction on to achieve this desired result.

import pyspark
from pyspark.context import SparkContext
from pyspark.sql import SparkSession
from scipy.spatial import distance


spark = SparkSession.builder.getOrCreate()

from pyspark.sql import *

traindf = spark.createDataFrame([
    ('u11',[1, 2, 3]),
    ('u12',[4, 5, 6]),
    ('u13',[7, 8, 9])
]).toDF("user","rating")

traindf.show()

输出

+----+---------+
|user|   rating|
+----+---------+
| u11|[1, 2, 3]|
| u12|[4, 5, 6]|
| u13|[7, 8, 9]|
+----+---------+

它想生成用户之间的相似性并将其放入 pyspark 数据框中.

It want to generate a similarity between user and put it in a pyspark dataframe.

parent_dict = {}
for parent_row in traindf.collect():
#     print(parent_row['user'],parent_row['rating'])
    child_dict = {}
    for child_row in traindf.collect():
        similarity = distance.cosine(parent_row['rating'],child_row['rating'])
        child_dict[child_row['user']] = similarity
    parent_dict[parent_row['user']] = child_dict

print(parent_dict)

输出:

{'u11': {'u11': 0.0, 'u12': 0.0253681538029239, 'u13': 0.0405880544333298},
 'u12': {'u11': 0.0253681538029239, 'u12': 0.0, 'u13': 0.001809107314273195},
 'u13': {'u11': 0.0405880544333298, 'u12': 0.001809107314273195, 'u13': 0.0}}

我想从这本字典中构建一个 pyspark 数据框.

From this dictionary I want to construct a pyspark Dataframe.

+-----+-----+--------------------+
|user1|user2|          similarity|
+-----+-----+--------------------+
|  u11|  u11|                 0.0|
|  u11|  u12|  0.0253681538029239|
|  u11|  u13|  0.0405880544333298|
|  u12|  u11|  0.0253681538029239|
|  u12|  u12|                 0.0|
|  u12|  u13|0.001809107314273195|
|  u13|  u11|  0.0405880544333298|
|  u13|  u12|0.001809107314273195|
|  u13|  u13|                 0.0|
+-----+-----+--------------------+

到目前为止我所尝试的是将 dict 转换为 pandas 数据帧并将其转换为 pyspark 数据帧.但是,我需要大规模地执行此操作,并且我正在寻找更具火花的方式来执行此操作.

What I have tried so far is convert dict to pandas dataframe and convert it to pyspark dataframe. However I need to do this on huge scale and I am looking for more spark-ish way of doing this.

parent_user = []
child_user = []
child_similarity = []

for parent_row in traindf.collect():
    
    for child_row in traindf.collect():
        similarity = distance.cosine(parent_row['rating'],child_row['rating'])
        child_user.append(child_row['user'])
        child_similarity.append(similarity)
        parent_user.append(parent_row['user'])

my_dict = {}
my_dict['user1'] = parent_user
my_dict['user2'] = child_user
my_dict['similarity'] = child_similarity

import pandas as pd

pd.DataFrame(my_dict)
df = spark.createDataFrame(pd.DataFrame(my_dict))
df.show()

输出:

+-----+-----+--------------------+
|user1|user2|          similarity|
+-----+-----+--------------------+
|  u11|  u11|                 0.0|
|  u11|  u12|  0.0253681538029239|
|  u11|  u13|  0.0405880544333298|
|  u12|  u11|  0.0253681538029239|
|  u12|  u12|                 0.0|
|  u12|  u13|0.001809107314273195|
|  u13|  u11|  0.0405880544333298|
|  u13|  u12|0.001809107314273195|
|  u13|  u13|                 0.0|
+-----+-----+--------------------+

推荐答案

也许你可以这样做:

import pandas as pd
from pyspark.sql import SQLContext

my_dic = {'u11': {'u11': 0.0, 'u12': 0.0253681538029239, 'u13': 0.0405880544333298},
                 'u12': {'u11': 0.0253681538029239, 'u12': 0.0, 'u13': 0.001809107314273195},
                 'u13': {'u11': 0.0405880544333298, 'u12': 0.001809107314273195, 'u13': 0.0}}

df =  pd.DataFrame.from_dict(my_dic).unstack().to_frame().reset_index()
df.columns = ['user1', 'user2', 'similarity']
sqlCtx = SQLContext(sc) # sc is spark context
sqlCtx.createDataFrame(df).show()

这篇关于将嵌套字典转换为 Pyspark 数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆