如何将pyspark数据框列转换为numpy数组 [英] How to convert a pyspark dataframe column to numpy array

查看：2181 发布时间：2020/5/18 23:35:52 python numpy apache-spark pyspark

本文介绍了如何将pyspark数据框列转换为numpy数组的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试将具有大约9000万行的pyspark dataframe列转换为numpy数组.

I am trying to convert a pyspark dataframe column having approximately 90 million rows into a numpy array.

我需要数组作为scipy.optimize.minimize函数的输入.

I need the array as an input for scipy.optimize.minimize function.

我曾经尝试过转换为Pandas并使用collect()，但是这些方法非常耗时.

I have tried both converting to Pandas and using collect(), but these methods are very time consuming.

我是PySpark的新手，如果有更快更好的方法，请帮忙.

I am new to PySpark, If there is a faster and better approach to do this, Please help.

谢谢

这就是我的数据框的样子.

This is how my dataframe looks like.

+----------+
|Adolescent|
+----------+
|       0.0|
|       0.0|
|       0.0|
|       0.0|
|       0.0|
|       0.0|
|       0.0|
|       0.0|
|       0.0|
|       0.0|
+----------+

#1

您将必须以任何方式调用.collect().要从pyspark数据帧创建一个numpy数组，可以使用:

#1

You will have to call a .collect() in any way. To create a numpy array from the pyspark dataframe, you can use:

adoles = np.array(df.select("Adolescent").collect()) #.reshape(-1) for 1-D array

#2

您可以使用

#2

You can convert it to a pandas dataframe using toPandas(), and you can then convert it to numpy array using .values.

pdf = df.toPandas()
adoles = df["Adolescent"].values

或者简单地:

adoles = df.select("Adolescent").toPandas().values #.reshape(-1) for 1-D array

对于分布式数组，您可以尝试黄昏数组

For distributed arrays, you can try Dask Arrays

我还没有测试过，但是假设它可以和numpy一样工作(可能有不一致的地方):

I haven't tested this, but assuming it would work the same as numpy (might have inconsistencies):

import dask.array as da
adoles = da.array(df.select("Adolescent").collect()) #.reshape(-1) for 1-D array

这篇关于如何将pyspark数据框列转换为numpy数组的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何将pyspark数据框列转换为numpy数组 [英] How to convert a pyspark dataframe column to numpy array

问题描述

推荐答案

#1

#1

#2

#2

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何将pyspark数据框列转换为numpy数组 [英] How to convert a pyspark dataframe column to numpy array

问题描述

推荐答案

#1

#1

#2

#2

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭