通过udf将数据帧火花到numpy数组或不收集到驱动程序 [英] Spark dataframe to numpy array via udf or without collecting to driver

查看：70 发布时间：2020/10/16 19:59:14 apache-spark pyspark user-defined-functions databricks horovod

本文介绍了通过udf将数据帧火花到numpy数组或不收集到驱动程序的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

现实生活df是无法加载到驱动程序内存中的海量数据帧。
可以使用常规或熊猫udf吗？

Real life df is a massive dataframe that cannot be loaded into driver memory. Can this be done using regular or pandas udf?

# Code to generate a sample dataframe

from pyspark.sql import functions as F
from pyspark.sql.types import *
import pandas as pd
import numpy as np

sample = [['123',[[0,1,0,0,0,1,1,1,1,1,1,0,1,0,0,0,1,1,1,1,1,1], [0,1,0,0,0,1,1,1,1,1,1,0,1,0,0,0,1,1,1,1,1,1]]],
      ['345',[[1,0,0,0,0,1,1,1,0,1,1,0,1,0,0,0,1,1,1,1,1,1], [0,1,0,0,0,1,1,1,1,1,1,0,1,0,0,0,1,1,1,1,1,1]]],
      ['425',[[1,1,0,0,0,1,0,1,1,1,1,0,1,0,0,0,1,1,1,1,1,1],[0,1,0,0,0,1,1,1,1,1,1,0,1,0,0,0,1,1,1,1,1,1]]],
      ]

df = spark.createDataFrame(sample,["id", "data"])

这是需要不依赖驱动程序内存而进行并行化的逻辑。

Here's the logic that needs to be parallelized without relying on driver memory.

输入：Spark dataframe
输出：将numpy数组输入到horovod中（如下所示： https://docs.databricks.com/applications/deep-learning/distributed-training/mnist-tensorflow-keras.html ）

Input: Spark dataframe Output: numpy array to be fed into horovod (Something like this: https://docs.databricks.com/applications/deep-learning/distributed-training/mnist-tensorflow-keras.html)

pandas_df = df.toPandas() # Not possible in real life
data_array = np.asarray(list(pandas_df.data.values))
data_array = data_array.reshape(data_array.shape[0], data_array.shape[1], -1, 1, order='F')
data_array = data_array.reshape(data_array.shape[0],data_array.shape[1],-1,1,1,order="F").transpose(0,1,3,2,-1)
# Some more numpy specific transformations ..

这是行不通的方法：

@pandas_udf(ArrayType(IntegerType()), PandasUDFType.SCALAR)
def generate_feature(x):
    data_array = np.asarray(x)
    data_array = data_array.reshape(data_array.shape[0], ..
    ...
    return pd.Series(data_array)

df = df.withColumn("data_array", generate_feature(df.data))

通过udf将数据帧火花到numpy数组或不收集到驱动程序 [英] Spark dataframe to numpy array via udf or without collecting to driver

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

通过udf将数据帧火花到numpy数组或不收集到驱动程序 [英] Spark dataframe to numpy array via udf or without collecting to driver

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭