在 Spark 上运行 Python Apache Beam 管道 [英] Running a python Apache Beam Pipeline on Spark

查看：41 发布时间：2021/11/11 22:40:48 python apache-spark apache-beam

本文介绍了在 Spark 上运行 Python Apache Beam 管道的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我在这里尝试了 apache beam(使用 python sdk)，所以我创建了一个简单的管道，并尝试将它部署在 Spark 集群上.

I am giving apache beam (with python sdk) a try here so I created a simple pipeline and I tried to deploy it on a Spark cluster.

from apache_beam.options.pipeline_options import PipelineOptions
import apache_beam as beam

op = PipelineOptions([
        "--runner=DirectRunner"
    ]
)


with beam.Pipeline(options=op) as p:
    p | beam.Create([1, 2, 3]) | beam.Map(lambda x: x+1) | beam.Map(print)

此管道与 DirectRunner 配合良好.所以要在 Spark 上部署相同的代码(因为可移植性是 Beam 中的一个关键概念)...

This pipeline is working well with DirectRunner. So to deploy the same code on Spark (as the portability is a key concept in Beam)...

首先，我编辑了 PipelineOptions，如前所述此处:

First I edited the PipelineOptions as mentioned here:

op = PipelineOptions([
        "--runner=PortableRunner",
        "--job_endpoint=localhost:8099",
        "--environment_type=LOOPBACK"
    ]
)

job_endpoint 是的 docker 容器的 url我使用以下命令运行的光束火花作业服务器:

docker run --net=host apache/beam_spark_job_server:latest --spark-master-url=spark://SPARK_URL:SPARK_PORT

这应该可以正常工作，但在 Spark 上作业失败并出现此错误:

This is supposed to work well but the job fails on Spark with this error :

20/10/31 14:35:58 ERROR TransportRequestHandler: Error while invoking RpcHandler#receive() for one-way message.

java.io.InvalidClassException: org.apache.spark.deploy.ApplicationDescription; local class incompatible: stream classdesc serialVersionUID = 6543101073799644159, local class serialVersionUID = 1574364215946805297

此外，我在 beam_spark_job_server 日志中有此警告:

Also, I have this WARN in the beam_spark_job_server logs:

WARN org.apache.beam.runners.spark.translation.SparkContextFactory: Creating a new Spark Context.

知道问题出在哪里吗?有没有其他方法可以在不通过容器化服务的情况下在 spark 上运行 python Beam Pipelines?

Any idea where is the problem here? Is there any other way to run python Beam Pipelines on spark without passing by a containerized service ?

在 Spark 上运行 Python Apache Beam 管道 [英] Running a python Apache Beam Pipeline on Spark

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

在 Spark 上运行 Python Apache Beam 管道 [英] Running a python Apache Beam Pipeline on Spark

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭