堆栈溢出而具有UDF处理几列 [英] Stack Overflow while processing several columns with a UDF

查看：257 发布时间：2016/5/22 15:39:12 python apache-spark pyspark spark-dataframe

本文介绍了堆栈溢出而具有UDF处理几列的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个数据帧与多列 STR 键入，我想一个函数适用于所有那些列，不重新命名他们的名字或添加更多的列，我尝试使用执行换的循环 withColumn （见例波纹管），但通常当我运行code，它显示了一个堆栈溢出（其作品很少），这数据帧不是很大的话，那刚刚〜15000的记录。

I have a DataFrame with many columns of str type, and I want to apply a function to all those columns, without renaming their names or adding more columns, I tried using a for-in loop executing withColumn (see example bellow), but normally when I run the code, it shows a Stack Overflow (it rarely works), this DataFrame is not big at all, it has just ~15000 records.

# df is a DataFrame
def lowerCase(string):
    return string.strip().lower()

lowerCaseUDF = udf(lowerCase, StringType())

for (columnName, kind) in df.dtypes:
    if(kind == "string"):
        df = df.withColumn(columnName, lowerCaseUDF(df[columnName]))

df.select("Tipo_unidad").distinct().show()

完整的错误是很长，所以我决定只粘贴一些线路。但是，你可以在这里找到完整的跟踪完整的曲线

Py4JJavaError：同时呼吁o516.showString发生错误。：
  org.apache.spark.SparkException：作业已中止由于阶段失败：
  任务1阶段2.0失败4次，最近一次失败：失落任务1.3
  在第一阶段2.0（TID 38，worker2.mcbo.mood.com.ve）：
  java.lang.StackOverflowError在
  java.io.ObjectInputStream中的$ BlockDataInputStream.readByte（ObjectInputStream.java:2774）

Py4JJavaError: An error occurred while calling o516.showString. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 2.0 failed 4 times, most recent failure: Lost task 1.3 in stage 2.0 (TID 38, worker2.mcbo.mood.com.ve): java.lang.StackOverflowError at java.io.ObjectInputStream$BlockDataInputStream.readByte(ObjectInputStream.java:2774)

我想，这个问题产生，因为这code推出许多作业（一个类型的每一列字符串），你能告诉我另一种替代或我在做什么错了？

I am thinking that this problem is produced because this code launches many jobs (one for each column of type string), could you show me another alternative or what I am doing wrong?

推荐答案

尝试是这样的：

from pyspark.sql.functions import col, lower, trim

exprs = [
    lower(trim(col(c))).alias(c) if t == "string" else col(c) 
    for (c, t) in df.dtypes
]

df.select(*exprs)

该方法有对你目前的解决方案主要有两大优势：

This approach has two main advantages over you current solution:

仅需要单投影（无增长的血统这对于SO最有可能负责），而不是每串列投影。

直接操作只有一个内部重新presentation没有将数据传递到Python的（ BatchPythonProcessing ）。

it requires only as single projection (no growing lineage which most likely responsible for SO) instead of projection per string column.
it operates directly only an internal representation without passing data to Python (BatchPythonProcessing).

这篇关于堆栈溢出而具有UDF处理几列的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

堆栈溢出而具有UDF处理几列 [英] Stack Overflow while processing several columns with a UDF

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

堆栈溢出而具有UDF处理几列 [英] Stack Overflow while processing several columns with a UDF

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭