pandas UDF和pyarrow 0.15.0 [英] pandasUDF and pyarrow 0.15.0
问题描述
我最近开始在EMR群集上运行的许多pyspark
作业中遇到一堆错误.错误是
I have recently started getting a bunch of errors on a number of pyspark
jobs running on EMR clusters. The erros are
java.lang.IllegalArgumentException
at java.nio.ByteBuffer.allocate(ByteBuffer.java:334)
at org.apache.arrow.vector.ipc.message.MessageSerializer.readMessage(MessageSerializer.java:543)
at org.apache.arrow.vector.ipc.message.MessageChannelReader.readNext(MessageChannelReader.java:58)
at org.apache.arrow.vector.ipc.ArrowStreamReader.readSchema(ArrowStreamReader.java:132)
at org.apache.arrow.vector.ipc.ArrowReader.initialize(ArrowReader.java:181)
at org.apache.arrow.vector.ipc.ArrowReader.ensureInitialized(ArrowReader.java:172)
at org.apache.arrow.vector.ipc.ArrowReader.getVectorSchemaRoot(ArrowReader.java:65)
at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:162)
at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:122)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:406)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at org.apache.spark.sql.execution.python.ArrowEvalPythonExec$$anon$2.<init>(ArrowEvalPythonExec.scala:98)
at org.apache.spark.sql.execution.python.ArrowEvalPythonExec.evaluate(ArrowEvalPythonExec.scala:96)
at org.apache.spark.sql.execution.python.EvalPythonExec$$anonfun$doExecute$1.apply(EvalPythonExec.scala:127)...
它们似乎都发生在熊猫系列的apply
功能中.我发现的唯一变化是pyarrow
已在星期六(05/10/2019)更新.测试似乎适用于0.14.1
They all seem to happen in apply
functions of a pandas series. The only change I found is that pyarrow
has been updated on Saturday (05/10/2019). Tests seem to work with 0.14.1
所以我的问题是,是否有人知道这是新更新的pyarrow中的错误,还是有一些重大更改会导致将来无法使用pandasUDF?
So my question is if anyone know if this is a bug in the new updated pyarrow or is there some significant change that will make pandasUDF hard to use in the future?
推荐答案
这不是错误.我们在0.15.0中进行了重要的协议更改,使pyarrow的默认行为与Java中的较旧版本的Arrow不兼容-您的Spark环境似乎正在使用较旧的版本.
It's not a bug. We made an important protocol change in 0.15.0 that makes the default behavior of pyarrow incompatible with older versions of Arrow in Java -- your Spark environment seems to be using an older version.
您的选择是
- 在使用Python的位置设置环境变量
ARROW_PRE_0_15_IPC_FORMAT=1
- 降级为pyarrow<现在为0.15.0.
希望Spark社区很快就能在Java中升级到0.15.0,所以这个问题就消失了.
Hopefully the Spark community will be able to upgrade to 0.15.0 in Java soon so this issue goes away.
这在 http://arrow.apache中进行了讨论.org/blog/2019/10/06/0.15.0-release/
这篇关于 pandas UDF和pyarrow 0.15.0的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!