从列表PySpark的列表创建单行数据框 [英] Create single row dataframe from list of list PySpark
问题描述
我有一个像这样的数据data = [[1.1, 1.2], [1.3, 1.4], [1.5, 1.6]]
我想创建一个PySpark数据框
I have a data like this data = [[1.1, 1.2], [1.3, 1.4], [1.5, 1.6]]
I want to create a PySpark dataframe
我已经使用
dataframe = SQLContext.createDataFrame(data, ['features'])
但我总是得到
+--------+---+
|features| _2|
+--------+---+
| 1.1|1.2|
| 1.3|1.4|
| 1.5|1.6|
+--------+---+
如何获得如下所示的结果?
how can I get result like below?
+----------+
|features |
+----------+
|[1.1, 1.2]|
|[1.3, 1.4]|
|[1.5, 1.6]|
+----------+
推荐答案
我发现将createDataFrame()
的参数视为元组列表很有用,其中列表中的每个条目对应于DataFrame中的一行,每个元组的元素对应于一列.
I find it's useful to think of the argument to createDataFrame()
as a list of tuples where each entry in the list corresponds to a row in the DataFrame and each element of the tuple corresponds to a column.
您可以通过将列表中的每个元素设置为元组来获得所需的输出:
You can get your desired output by making each element in the list a tuple:
data = [([1.1, 1.2],), ([1.3, 1.4],), ([1.5, 1.6],)]
dataframe = sqlCtx.createDataFrame(data, ['features'])
dataframe.show()
#+----------+
#| features|
#+----------+
#|[1.1, 1.2]|
#|[1.3, 1.4]|
#|[1.5, 1.6]|
#+----------+
或者,如果更改来源很麻烦,您也可以这样做:
Or if changing the source is cumbersome, you can equivalently do:
data = [[1.1, 1.2], [1.3, 1.4], [1.5, 1.6]]
dataframe = sqlCtx.createDataFrame(map(lambda x: (x, ), data), ['features'])
dataframe.show()
#+----------+
#| features|
#+----------+
#|[1.1, 1.2]|
#|[1.3, 1.4]|
#|[1.5, 1.6]|
#+----------+
这篇关于从列表PySpark的列表创建单行数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!