使用Spark(Python)和Dataproc从Google Storage下载文件 [英] Downloading files from Google Storage using Spark (Python) and Dataproc

查看:136
本文介绍了使用Spark(Python)和Dataproc从Google Storage下载文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个与Python对象并行执行的应用程序,该对象处理要从Google Storage(我的项目存储区)下载的数据.该集群是使用Google Dataproc创建的.问题是数据永远不会下载!我编写了一个测试程序来尝试并理解问题. 我编写了以下函数来从存储桶中复制文件,并查看在worker上创建文件是否有效:

I have an application that parallelizes the execution of Python objects that process data to be downloaded from Google Storage (my project bucket). The cluster is created using Google Dataproc. The problem is that the data is never downloaded! I wrote a test program to try and understand the problem. I wrote the following function to copy the files from the bucket and to see if creating files on workers does work:

from subprocess import call
from os.path import join

def copyDataFromBucket(filename,remoteFolder,localFolder):
  call(["gsutil","-m","cp",join(remoteFolder,filename),localFolder]

def execTouch(filename,localFolder):
  call(["touch",join(localFolder,"touched_"+filename)])

我已经通过从python shell调用此函数来对其进行了测试,并且可以正常工作.但是,当我使用spark-submit运行以下代码时,不会下载文件(但不会引发错误):

I've tested this function by calling it from a python shell and it works. But when I run the following code using spark-submit, the files are not downloaded (but no error is raised):

# ...
filesRDD = sc.parallelize(fileList)
filesRDD.foreach(lambda myFile: copyDataFromBucket(myFile,remoteBucketFolder,'/tmp/output')
filesRDD.foreach(lambda myFile: execTouch(myFile,'/tmp/output')
# ...

execTouch函数起作用(我可以看到每个工作进程上的文件),但是copyDataFromBucket函数什么也没做.

The execTouch function works (I can see the files on each worker) but the copyDataFromBucket function does nothing.

那我在做什么错了?

推荐答案

问题显然是Spark上下文.将对"gsutil"的调用替换为对"hadoop fs"的调用即可解决:

The problem was clearly the Spark context. Replacing the call to "gsutil" by a call to "hadoop fs" solves it:

from subprocess import call
from os.path import join

def copyDataFromBucket(filename,remoteFolder,localFolder):
  call(["hadoop","fs","-copyToLocal",join(remoteFolder,filename),localFolder]

我也做了一个测试,将数据发送到存储桶.一个人只需将"-copyToLocal"替换为"-copyFromLocal"

I also did a test to send data to the bucket. One only needs to replace "-copyToLocal" by "-copyFromLocal"

这篇关于使用Spark(Python)和Dataproc从Google Storage下载文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆