将Pyspark数据帧转换为字典 [英] Convert Pyspark dataframe to dictionary

查看:904
本文介绍了将Pyspark数据帧转换为字典的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将Pyspark数据帧转换为字典.

I'm trying to convert a Pyspark dataframe into a dictionary.

这是示例CSV文件-

Col0, Col1
-----------
A153534,BDBM40705
R440060,BDBM31728
P440245,BDBM50445050

我想出了这段代码-

from rdkit import Chem
from pyspark import SparkContext
from pyspark.conf import SparkConf
from pyspark.sql import SparkSession

sc = SparkContext.getOrCreate()
spark = SparkSession(sc)

df = spark.read.csv("gs://my-bucket/my_file.csv") # has two columns

# Creating list
to_list = map(lambda row: row.asDict(), df.collect())

#Creating dictionary
to_dict = {x['col0']: x for x in to_list }

这将创建一个如下所示的字典-

This creates a dictionary like below -

'A153534': {'col0': 'A153534', 'col1': 'BDBM40705'}, 'R440060': {'col0': 'R440060', 'col1': 'BDBM31728'}, 'P440245': {'col0': 'P440245', 'col1': 'BDBM50445050'} 

但是我想要这样的字典-

But I want a dictionary like this -

{'A153534': 'BDBM40705'}, {'R440060': 'BDBM31728'}, {'P440245': 'BDBM50445050'}

我该怎么做?

我尝试了Yolo的 rdd 解决方案,但出现错误.你能告诉我我在做什么错吗?

I tried the rdd solution by Yolo but I'm getting error. Can you please tell me what I am doing wrong?

py4j.protocol.Py4JError:调用时发生错误 o80.isBarrier.跟踪:py4j.Py4JException:方法isBarrier([])可以 不存在 在py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318) 在py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326) 在py4j.Gateway.invoke(Gateway.java:274) 在py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) 在py4j.commands.CallCommand.execute(CallCommand.java:79) 在py4j.GatewayConnection.run(GatewayConnection.java:238) 在java.lang.Thread.run(Thread.java:748)

py4j.protocol.Py4JError: An error occurred while calling o80.isBarrier. Trace: py4j.Py4JException: Method isBarrier([]) does not exist at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318) at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326) at py4j.Gateway.invoke(Gateway.java:274) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748)

推荐答案

以下是使用

Here's a way of doing it using rdd:

df.rdd.map(lambda x: {x.Col0: x.Col1}).collect()

[{'A153534': 'BDBM40705'}, {'R440060': 'BDBM31728'}, {'P440245': 'BDBM50445050'}]

这篇关于将Pyspark数据帧转换为字典的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆