将Spark DataFrame列转换为python列表 [英] Convert spark DataFrame column to python list
问题描述
我在一个包含两列mvv和count的数据帧上工作.
I work on a dataframe with two column, mvv and count.
+---+-----+
|mvv|count|
+---+-----+
| 1 | 5 |
| 2 | 9 |
| 3 | 3 |
| 4 | 1 |
我想获得两个包含mvv值和计数值的列表.像
i would like to obtain two list containing mvv values and count value. Something like
mvv = [1,2,3,4]
count = [5,9,3,1]
因此,我尝试了以下代码:第一行应返回python行的列表.我想看第一个值:
So, I tried the following code: The first line should return a python list of row. I wanted to see the first value:
mvv_list = mvv_count_df.select('mvv').collect()
firstvalue = mvv_list[0].getInt(0)
但是我在第二行收到一条错误消息:
But I get an error message with the second line:
AttributeError:getInt
AttributeError: getInt
推荐答案
请参阅为什么您这样做的方式不起作用.首先,您尝试从行类型,收集的输出如下:
See, why this way that you are doing is not working. First, you are trying to get integer from a Row Type, the output of your collect is like this:
>>> mvv_list = mvv_count_df.select('mvv').collect()
>>> mvv_list[0]
Out: Row(mvv=1)
如果您采取这样的做法:
If you take something like this:
>>> firstvalue = mvv_list[0].mvv
Out: 1
您将获得mvv
值.如果您需要数组的所有信息,则可以采取以下方法:
You will get the mvv
value. If you want all the information of the array you can take something like this:
>>> mvv_array = [int(row.mvv) for row in mvv_list.collect()]
>>> mvv_array
Out: [1,2,3,4]
但是,如果对另一列尝试相同的操作,则会得到:
But if you try the same for the other column, you get:
>>> mvv_count = [int(row.count) for row in mvv_list.collect()]
Out: TypeError: int() argument must be a string or a number, not 'builtin_function_or_method'
之所以会发生这种情况,是因为count
是内置方法.并且该列与count
同名.一种解决方法是将count
的列名更改为_count
:
This happens because count
is a built-in method. And the column has the same name as count
. A workaround to do this is change the column name of count
to _count
:
>>> mvv_list = mvv_list.selectExpr("mvv as mvv", "count as _count")
>>> mvv_count = [int(row._count) for row in mvv_list.collect()]
但是不需要此解决方法,因为您可以使用字典语法访问列:
But this workaround is not needed, as you can access the column using the dictionary syntax:
>>> mvv_array = [int(row['mvv']) for row in mvv_list.collect()]
>>> mvv_count = [int(row['count']) for row in mvv_list.collect()]
它最终将起作用!
这篇关于将Spark DataFrame列转换为python列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!