如何将pyspark数据框列转换为numpy数组 [英] How to convert a pyspark dataframe column to numpy array
问题描述
我正在尝试将具有大约9000万行的pyspark dataframe列转换为numpy数组.
I am trying to convert a pyspark dataframe column having approximately 90 million rows into a numpy array.
我需要数组作为scipy.optimize.minimize
函数的输入.
I need the array as an input for scipy.optimize.minimize
function.
我曾经尝试过转换为Pandas并使用collect()
,但是这些方法非常耗时.
I have tried both converting to Pandas and using collect()
, but these methods are very time consuming.
我是PySpark的新手,如果有更快更好的方法,请帮忙.
I am new to PySpark, If there is a faster and better approach to do this, Please help.
谢谢
这就是我的数据框的样子.
This is how my dataframe looks like.
+----------+
|Adolescent|
+----------+
| 0.0|
| 0.0|
| 0.0|
| 0.0|
| 0.0|
| 0.0|
| 0.0|
| 0.0|
| 0.0|
| 0.0|
+----------+
推荐答案
#1
您将必须以任何方式调用.collect()
.要从pyspark数据帧创建一个numpy数组,可以使用:
#1
You will have to call a .collect()
in any way. To create a numpy array from the pyspark dataframe, you can use:
adoles = np.array(df.select("Adolescent").collect()) #.reshape(-1) for 1-D array
#2
#2
You can convert it to a pandas dataframe using toPandas(), and you can then convert it to numpy array using .values
.
pdf = df.toPandas()
adoles = df["Adolescent"].values
或者简单地:
adoles = df.select("Adolescent").toPandas().values #.reshape(-1) for 1-D array
对于分布式数组,您可以尝试黄昏数组
For distributed arrays, you can try Dask Arrays
我还没有测试过,但是假设它可以和numpy一样工作(可能有不一致的地方):
I haven't tested this, but assuming it would work the same as numpy (might have inconsistencies):
import dask.array as da
adoles = da.array(df.select("Adolescent").collect()) #.reshape(-1) for 1-D array
这篇关于如何将pyspark数据框列转换为numpy数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!