来自没有 Pandas 的 Python 字典的 PySpark 数据框 [英] PySpark Dataframe from Python Dictionary without Pandas

查看：40 发布时间：2021/11/14 21:45:52 pyspark pyspark-sql

本文介绍了来自没有 Pandas 的 Python 字典的 PySpark 数据框的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试将以下 Python dict 转换为 PySpark DataFrame，但没有得到预期的输出.

I am trying to convert the following Python dict into PySpark DataFrame but I am not getting expected output.

dict_lst = {'letters': ['a', 'b', 'c'], 
             'numbers': [10, 20, 30]}
df_dict = sc.parallelize([dict_lst]).toDF()  # Result not as expected
df_dict.show()

有没有不使用 Pandas 的方法来做到这一点?

Is there a way to do this without using Pandas?

推荐答案

引用我自己:

我发现将 createDataFrame() 的参数视为一个元组列表很有用，其中列表中的每个条目对应于 DataFrame 中的一行，而元组的每个元素对应于一列.

I find it's useful to think of the argument to createDataFrame() as a list of tuples where each entry in the list corresponds to a row in the DataFrame and each element of the tuple corresponds to a column.

所以最简单的就是把你的字典转换成这种格式.您可以使用 zip() 轻松完成此操作:

So the easiest thing is to convert your dictionary into this format. You can easily do this using zip():

column_names, data = zip(*dict_lst.items())
spark.createDataFrame(zip(*data), column_names).show()
#+-------+-------+
#|letters|numbers|
#+-------+-------+
#|      a|     10|
#|      b|     20|
#|      c|     30|
#+-------+-------+

以上假设所有列表的长度相同.如果不是这种情况，您将不得不使用 itertools.izip_longest (python2) 或 itertools.zip_longest (python3).

The above assumes that all of the lists are the same length. If this is not the case, you would have to use itertools.izip_longest (python2) or itertools.zip_longest (python3).

from itertools import izip_longest as zip_longest # use this for python2
#from itertools import zip_longest # use this for python3

dict_lst = {'letters': ['a', 'b', 'c'], 
             'numbers': [10, 20, 30, 40]}

column_names, data = zip(*dict_lst.items())

spark.createDataFrame(zip_longest(*data), column_names).show()
#+-------+-------+
#|letters|numbers|
#+-------+-------+
#|      a|     10|
#|      b|     20|
#|      c|     30|
#|   null|     40|
#+-------+-------+

这篇关于来自没有 Pandas 的 Python 字典的 PySpark 数据框的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

来自没有 Pandas 的 Python 字典的 PySpark 数据框 [英] PySpark Dataframe from Python Dictionary without Pandas

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

来自没有 Pandas 的 Python 字典的 PySpark 数据框 [英] PySpark Dataframe from Python Dictionary without Pandas

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭