boto3无法在pyspark worker上创建客户端? [英] boto3 cannot create client on pyspark worker?

查看:81
本文介绍了boto3无法在pyspark worker上创建客户端?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用boto3与AWS进行通信,将数据从Pyspark RDD的工作人员发送到SQS队列.我需要直接从分区发送数据,而不是收集RDD并从驱动程序发送数据.

I'm trying to send data from the workers of a Pyspark RDD to an SQS queue, using boto3 to talk with AWS. I need to send data directly from the partitions, rather than collecting the RDD and sending data from the driver.

我能够通过本地的boto3将邮件发送到SQS.从Spark驱动程序;另外,我可以导入boto3并在分区上创建boto3会话.但是,当我尝试从分区创建客户端或资源时,会收到错误消息.我相信boto3无法正确创建客户端,但是我不确定那一点.我的代码如下:

I am able to send messages to SQS via boto3 locally & from the Spark driver; also, I can import boto3 and create a boto3 session on the partitions. However when I try to create a client or resource from the partitions I receive an error. I believe boto3 is not correctly creating a client, but I'm not entirely sure on that point. My code looks like this:

def get_client(x):   #the x is required to use pyspark's mapPartitions
    import boto3
    client = boto3.client('sqs', region_name="us-east-1", aws_access_key_id="myaccesskey", aws_secret_access_key="mysecretaccesskey")
    return x

rdd_with_client = rdd.mapPartitions(get_client)

错误:

DataNotFoundError: Unable to load data for: endpoints

更长的回溯:

File "<stdin>", line 4, in get_client
  File "./rebuilt.zip/boto3/session.py", line 250, in client
    aws_session_token=aws_session_token, config=config)
  File "./rebuilt.zip/botocore/session.py", line 810, in create_client
    endpoint_resolver = self.get_component('endpoint_resolver')
  File "./rebuilt.zip/botocore/session.py", line 691, in get_component
    return self._components.get_component(name)
  File "./rebuilt.zip/botocore/session.py", line 872, in get_component
    self._components[name] = factory()
  File "./rebuilt.zip/botocore/session.py", line 184, in create_default_resolver
    endpoints = loader.load_data('endpoints')
  File "./rebuilt.zip/botocore/loaders.py", line 123, in _wrapper
    data = func(self, *args, **kwargs)
  File "./rebuilt.zip/botocore/loaders.py", line 382, in load_data
    raise DataNotFoundError(data_path=name)
DataNotFoundError: Unable to load data for: endpoints

我还尝试修改函数以创建资源而不是显式客户端,以查看它是否可以找到&使用默认的客户端设置.在这种情况下,我的代码是:

I've also tried modifying my function to create a resource instead of the explicit client, to see if it could find & use the default client setup. In that case, my code is:

def get_resource(x):
    import boto3
    sqs = boto3.resource('sqs', region_name="us-east-1", aws_access_key_id="myaccesskey", aws_secret_access_key="mysecretaccesskey")
    return x

rdd_with_client = rdd.mapPartitions(get_resource)

我收到一个指向has_low_level_client参数的错误,该错误是由于客户端不存在而触发的;追溯说:

I receive an error pointing to a has_low_level_client parameter, which is triggered because the client doesn't exist; the traceback says:

File "/usr/lib/spark/python/pyspark/rdd.py", line 2253, in pipeline_func
  File "/usr/lib/spark/python/pyspark/rdd.py", line 270, in func
  File "/usr/lib/spark/python/pyspark/rdd.py", line 689, in func
  File "<stdin>", line 4, in session_resource
  File "./rebuilt.zip/boto3/session.py", line 329, in resource
    has_low_level_client)
ResourceNotExistsError: The 'sqs' resource does not exist.
The available resources are:
   -

无可用资源,因为我认为没有客户可以容纳它们.

No resources available because, I think, there's no client to house them.

几天来,我一直在反对这个问题.任何帮助表示赞赏!

I've been banging my head against this one for a few days now. Any help appreciated!

推荐答案

这是因为您拥有boto3捆绑包作为zip文件.

This is because you have the boto3 bundle as a zip file.

"./rebuilt.zip/boto3"

"./rebuilt.zip/boto3"

boto3进行初始化的方法是,它将下载一堆文件并将其保存在分发文件夹中.由于您的boto3驻留在zip包中,因此显然这些文件将无法将其保存到那里.

What boto3 does for initialisation is it will download a bunch files and save it inside the distribution folder. Because your boto3 lives in a zip package, so obviously those files won't be able to it to there.

解决方案是,您应该在自己的Spark环境中安装boto3,而不是在zip里面分发boto3.这里要小心,您可能要在主节点和辅助节点上都安装boto3,这取决于实现应用程序的方式.可以同时安装安全投注.

Solution is, rather then distribute boto3 inside a zip, you should have boto3 installed on your Spark environment. Be careful here, you might want to install boto3 both on the master node and worker nodes, depends on how you implement your app. Safe bet is install on both.

如果您使用的是EMR,则可以使用引导步骤执行.

If you are using EMR, you can use bootstrap step to do it.

这篇关于boto3无法在pyspark worker上创建客户端?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆