pandas 标量UDF失败,IllegalArgumentException [英] Pandas scalar UDF failing, IllegalArgumentException

查看:91
本文介绍了 pandas 标量UDF失败,IllegalArgumentException的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

首先,如果我的问题很简单,我深表歉意.我确实花了很多时间研究它.

First off, I apologize if my issue is simple. I did spend a lot of time researching it.

我正在尝试按照这是我的代码:

from pyspark import SparkContext
from pyspark.sql import functions as F
from pyspark.sql.types import *
from pyspark.sql import SQLContext
sc.install_pypi_package("pandas")
import pandas as pd
sc.install_pypi_package("PyArrow")

df = spark.createDataFrame(
    [("a", 1, 0), ("a", -1, 42), ("b", 3, -1), ("b", 10, -2)],
    ("key", "value1", "value2")
)

df.show()

@F.pandas_udf("double", F.PandasUDFType.SCALAR)
def pandas_plus_one(v):
    return pd.Series(v + 1)

df.select(pandas_plus_one(df.value1)).show()
# Also fails
#df.select(pandas_plus_one(df["value1"])).show()
#df.select(pandas_plus_one("value1")).show()
#df.select(pandas_plus_one(F.col("value1"))).show()

脚本在最后一条语句处失败:

The script fails at the last statement:

调用o209.showString时发生错误.:org.apache.spark.SparkException:由于阶段失败,作业中止了:阶段8.0中的任务2失败4次,最近一次失败:丢失任务2.3在阶段8.0(TID 30,ip-10-160-2-53.ec2.internal,执行者3)中: java.lang.IllegalArgumentException 在java.nio.ByteBuffer.allocate(ByteBuffer.java:334)在org.apache.arrow.vector.ipc.message.MessageSerializer.readMessage(MessageSerializer.java:543)在org.apache.arrow.vector.ipc.message.MessageChannelReader.readNext(MessageChannelReader.java:58)在org.apache.arrow.vector.ipc.ArrowStreamReader.readSchema(ArrowStreamReader.java:132)在org.apache.arrow.vector.ipc.ArrowReader.initialize(ArrowReader.java:181)在org.apache.arrow.vector.ipc.ArrowReader.ensureInitialized(ArrowReader.java:172)在org.apache.arrow.vector.ipc.ArrowReader.getVectorSchemaRoot(ArrowReader.java:65)在org.apache.spark.sql.execution.python.ArrowPythonRunner $$ anon $ 1.read(ArrowPythonRunner.scala:162)在org.apache.spark.sql.execution.python.ArrowPythonRunner $$ anon $ 1.read(ArrowPythonRunner.scala:122)在org.apache.spark.api.python.BasePythonRunner $ ReaderIterator.hasNext(PythonRunner.scala:410)...

An error occurred while calling o209.showString. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 8.0 failed 4 times, most recent failure: Lost task 2.3 in stage 8.0 (TID 30, ip-10-160-2-53.ec2.internal, executor 3): java.lang.IllegalArgumentException at java.nio.ByteBuffer.allocate(ByteBuffer.java:334) at org.apache.arrow.vector.ipc.message.MessageSerializer.readMessage(MessageSerializer.java:543) at org.apache.arrow.vector.ipc.message.MessageChannelReader.readNext(MessageChannelReader.java:58) at org.apache.arrow.vector.ipc.ArrowStreamReader.readSchema(ArrowStreamReader.java:132) at org.apache.arrow.vector.ipc.ArrowReader.initialize(ArrowReader.java:181) at org.apache.arrow.vector.ipc.ArrowReader.ensureInitialized(ArrowReader.java:172) at org.apache.arrow.vector.ipc.ArrowReader.getVectorSchemaRoot(ArrowReader.java:65) at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:162) at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:122) at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:410) ...

我在这里想念什么?我只是在遵循手册.谢谢您的帮助

What am I missing here? I am just following the manual. Thanks for your help

推荐答案

Pyarrow于2019年10月5日推出了新版本0.15,这导致熊猫Udf引发错误.Spark需要升级才能与此兼容(这可能需要一些时间).您可以在此处 https://issues中了解进度.apache.org/jira/projects/SPARK/issues/SPARK-29367?filter=allissues

Pyarrow rolled out a new version 0.15 on october 5,2019 which causes pandas Udf to throw error. Spark needs to upgrade to be compatible with this(which might take some time). You can follow the progress here https://issues.apache.org/jira/projects/SPARK/issues/SPARK-29367?filter=allissues

解决方案:

  1. 您需要安装Pyarrow 0.14.1或更低版本.<sc.install_pypi_package("pyarrow == 0.14.1")>(或)
  2. 在使用Python的位置设置环境变量 ARROW_PRE_0_15_IPC_FORMAT = 1
  1. You need to install Pyarrow 0.14.1 or lower. < sc.install_pypi_package("pyarrow==0.14.1") > (or)
  2. Set the environment variable ARROW_PRE_0_15_IPC_FORMAT=1 from where you are using Python

这篇关于 pandas 标量UDF失败,IllegalArgumentException的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆