从 Pyspark Dataframe 中提取 numpy 数组 [英] extracting numpy array from Pyspark Dataframe
问题描述
我有一个数据框gi_man_df,其中组可以是n:
I have a dataframe gi_man_df where group can be n:
+------------------+-----------------+--------+--------------+
| group | number|rand_int| rand_double|
+------------------+-----------------+--------+--------------+
| 'GI_MAN'| 7| 3| 124.2|
| 'GI_MAN'| 7| 10| 121.15|
| 'GI_MAN'| 7| 11| 129.0|
| 'GI_MAN'| 7| 12| 125.0|
| 'GI_MAN'| 7| 13| 125.0|
| 'GI_MAN'| 7| 21| 127.0|
| 'GI_MAN'| 7| 22| 126.0|
+------------------+-----------------+--------+--------------+
我期待一个 numpy nd_array,即 gi_man_array:
and I am expecting a numpy nd_array i.e, gi_man_array:
[[[124.2],[121.15],[129.0],[125.0],[125.0],[127.0],[126.0]]]
应用数据透视后的 rand_double 值在哪里.
where rand_double values after applying pivot.
我尝试了以下两种方法:
首先:我按如下方式旋转 gi_man_df:
I tried the following 2 approaches:
FIRST: I pivot the gi_man_df as follows:
gi_man_pivot = gi_man_df.groupBy("number").pivot('rand_int').sum("rand_double")
我得到的输出是:
Row(number=7, group=u'GI_MAN', 3=124.2, 10=121.15, 11=129.0, 12=125.0, 13=125.0, 21=127.0, 23=126.0)
但这里的问题是获得所需的输出,我无法将其转换为矩阵,然后再次转换为 numpy 数组.
but here the problem is to get the desired output, I can't convert it to matrix then convert again to numpy array.
第二:我使用以下方法在数据帧本身中创建了向量:
SECOND: I created the vector in the dataframe itself using:
assembler = VectorAssembler(inputCols=["rand_double"],outputCol="rand_double_vector")
gi_man_vector = assembler.transform(gi_man_df)
gi_man_vector.show(7)
我得到以下输出:
+----------------+-----------------+--------+--------------+--------------+
| group| number|rand_int| rand_double| rand_dbl_Vect|
+----------------+-----------------+--------+--------------+--------------+
| GI_MAN| 7| 3| 124.2| [124.2]|
| GI_MAN| 7| 10| 121.15| [121.15]|
| GI_MAN| 7| 11| 129.0| [129.0]|
| GI_MAN| 7| 12| 125.0| [125.0]|
| GI_MAN| 7| 13| 125.0| [125.0]|
| GI_MAN| 7| 21| 127.0| [127.0]|
| GI_MAN| 7| 22| 126.0| [126.0]|
+----------------+-----------------+--------+--------------+--------------+
但这里的问题是我无法在 rand_dbl_Vect 上旋转它.
but problem here is I can't pivot it on rand_dbl_Vect.
所以我的问题是:
1. 两种方法中的任何一种是否是实现所需输出的正确方法,如果是,那么我如何进一步获得所需的结果?
2. 我还有什么其他方法可以使代码最佳且性能良好?
So my question is:
1. Is any of the 2 approaches is correct way of achieving the desired output, if so then how can I proceed further to get the desired result?
2. What other way I can proceed with so the code is optimal and performance is good?
推荐答案
这个
import numpy as np
np.array(gi_man_df.select('rand_double').collect())
产生
array([[ 124.2 ],
[ 121.15],
.........])
这篇关于从 Pyspark Dataframe 中提取 numpy 数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!