AWS EMR - ModuleNotFoundError:没有名为“pyarrow"的模块 [英] AWS EMR - ModuleNotFoundError: No module named 'pyarrow'

查看:55
本文介绍了AWS EMR - ModuleNotFoundError:没有名为“pyarrow"的模块的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在使用 Apache Arrow Spark 集成时遇到了这个问题.

使用带有 Spark 2.4.3 的 AWS EMR

在本地 spark 单机实例和 Cloudera 集群上测试了这个问题,一切正常.

在 spark-env.sh 中设置这些

export PYSPARK_PYTHON=python3导出 PYSPARK_PYTHON_DRIVER=python3

在 spark shell 中确认了这一点

spark.version2.4.3sc.pythonExec蟒蛇3SC.pythonVer蟒蛇3

使用 apache 箭头集成运行基本的 pandas_udf 会导致错误

from pyspark.sql.functions import pandas_udf, PandasUDFTypedf = spark.createDataFrame([(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],("id", "v"))@pandas_udf("id long, v double", PandasUDFType.GROUPED_MAP)定义减法平均值(pdf):# pdf 是一个 pandas.DataFramev = pdf.v返回 pdf.assign(v=v - v.mean())df.groupby("id").apply(subtract_mean).show()

aws emr 错误 [在 cloudera 和本地机器上没有错误]

ModuleNotFoundError:没有名为pyarrow"的模块在 org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:452)在 org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:172)在 org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:122)在 org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:406)在 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)在 scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)在 scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)在 org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(来源不明)在 org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)在 org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)在 org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:291)在 org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:283)在 org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)在 org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)在 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)在 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)在 org.apache.spark.rdd.RDD.iterator(RDD.scala:288)在 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)在 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)在 org.apache.spark.rdd.RDD.iterator(RDD.scala:288)在 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)在 org.apache.spark.scheduler.Task.run(Task.scala:121)在 org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)在 org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)在 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)在 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)在 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)在 java.lang.Thread.run(Thread.java:748)

有人知道这是怎么回事吗?一些可能的想法...

PYTHONPATH 会不会因为我没有使用 anaconda 而导致问题?

和Spark版和Arrow版有关系吗?

这是最奇怪的事情,因为我在所有 3 个平台 [本地桌面、cloudera、emr] 中使用相同的版本,只有 EMR 不起作用......

我登录了所有 4 个 EMR EC2 数据节点并测试我可以导入pyarrow,它工作得很好,但在尝试将它与 spark 一起使用时就不行了

# test将 numpy 导入为 np将熊猫导入为 pd导入pyarrow作为padf = pd.DataFrame({'one': [20, np.nan, 2.5],'two': ['january', 'february', 'march'],'three': [True, False, True]},index=list('abc'))table = pa.Table.from_pandas(df)

解决方案

在 EMR 中 python3 默认不解析.你必须让它明确.一种方法是在创建集群时传递 config.json 文件.它在 AWS EMR UI 的 Edit software settings 部分可用.一个示例 json 文件看起来像这样.

<预><代码>[{"分类": "spark-env",配置":[{"分类": "出口",特性": {"PYSPARK_PYTHON": "/usr/bin/python3"}}]},{"分类": "yarn-env",特性": {},配置":[{"分类": "出口",特性": {"PYSPARK_PYTHON": "/usr/bin/python3"}}]}]

此外,您还需要在所有核心节点中安装 pyarrow 模块,而不仅仅是在主节点中.为此,您可以在 AWS 中创建集群时使用引导脚本.同样,示例引导脚本可以像这样简单:

#!/bin/bash须藤 python3 -m pip install pyarrow==0.13.0

I am running into this problem w/ Apache Arrow Spark Integration.

Using AWS EMR w/ Spark 2.4.3

Tested this problem on both local spark single machine instance and a Cloudera cluster and everything works fine.

set these in spark-env.sh

export PYSPARK_PYTHON=python3
export PYSPARK_PYTHON_DRIVER=python3

confirmed this in spark shell

spark.version
2.4.3
sc.pythonExec
python3
SC.pythonVer
python3

running basic pandas_udf with apache arrow integration results in error

from pyspark.sql.functions import pandas_udf, PandasUDFType

df = spark.createDataFrame(
    [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
    ("id", "v"))

@pandas_udf("id long, v double", PandasUDFType.GROUPED_MAP)
def subtract_mean(pdf):
    # pdf is a pandas.DataFrame
    v = pdf.v
    return pdf.assign(v=v - v.mean())

df.groupby("id").apply(subtract_mean).show()

error on aws emr [doesn't error on cloudera and local machine]

ModuleNotFoundError: No module named 'pyarrow'

        at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:452)
        at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:172)
        at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:122)
        at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:406)
        at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
        at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown Source)
        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:291)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:283)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:121)
        at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

Anyone have an idea what is going on? some possible ideas ...

Could PYTHONPATH be causing a problem because I am not using anaconda?

Does it have to do with the Spark Version and Arrow Version?

This is the strangest thing because I am using the same versions across within all 3 platforms [local desktop, cloudera, emr] and only EMR is not working ...

I logged into all 4 EMR EC2 data nodes and tested that I can importpyarrow and it works totally fine but not when trying to use it with spark

# test

import numpy as np
import pandas as pd
import pyarrow as pa
df = pd.DataFrame({'one': [20, np.nan, 2.5],'two': ['january', 'february', 'march'],'three': [True, False, True]},index=list('abc'))
table = pa.Table.from_pandas(df)

解决方案

In EMR python3 is not resolved by default. You have to make it explicit. One way to do it is to pass a config.json file as you're creating the cluster. It's available in the Edit software settings section in AWS EMR UI. A sample json file looks something like this.

[
  {
    "Classification": "spark-env",
    "Configurations": [
      {
        "Classification": "export",
        "Properties": {
          "PYSPARK_PYTHON": "/usr/bin/python3"
        }
      }
    ]
  },
  {
    "Classification": "yarn-env",
    "Properties": {},
    "Configurations": [
      {
        "Classification": "export",
        "Properties": {
          "PYSPARK_PYTHON": "/usr/bin/python3"
        }
      }
    ]
  }
]

Also you need to have the pyarrow module installed in all core nodes, not only in the master. For that you can use a bootstrap script while creating the cluster in AWS. Again, a sample bootstrap script can be as simple as something like this:

#!/bin/bash
sudo python3 -m pip install pyarrow==0.13.0

这篇关于AWS EMR - ModuleNotFoundError:没有名为“pyarrow"的模块的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆