pyspark 和 HDFS 命令 [英] pyspark and HDFS commands

查看:34
本文介绍了pyspark 和 HDFS 命令的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在我的 Spark 程序 (Pyspark) 开始时做一些清理工作.例如,我想删除以前运行 HDFS 的数据.在 pig 中,这可以使用诸如

I would like to do some cleanup at the start of my Spark program (Pyspark). For example, I would like to delete data from previous HDFS run. In pig this can be done using commands such as

fs -copyFromLocal ....

rmf /path/to-/hdfs

或在本地使用 sh 命令.

or locally using sh command.

我想知道如何用 Pyspark 做同样的事情.

I was wondering how to do the same with Pyspark.

推荐答案

您可以使用表单示例执行任意 shell 命令 subprocess.callsh library 所以像这样的东西应该可以正常工作:

You can execute arbitrary shell command using form example subprocess.call or sh library so something like this should work just fine:

import subprocess

some_path = ...
subprocess.call(["hadoop", "fs", "-rm", "-f", some_path])

如果您使用 Python 2.x,您可以尝试使用 spotify/snakebite:

If you use Python 2.x you can try using spotify/snakebite:

from snakebite.client import Client

host = ...
port = ...
client = Client(host, port)
client.delete(some_path, recurse=True)

hdfs3 是另一个可用于做同样的事情:

hdfs3 is yet another library which can be used to do the same thing:

from hdfs3 import HDFileSystem

hdfs = HDFileSystem(host=host, port=port)
HDFileSystem.rm(some_path)

Apache Arrow Python 绑定最新选项(通常在 Spark 集群上已经可用,因为 pandas_udf 需要它):

Apache Arrow Python bindings are the latest option (and that often is already available on Spark cluster, as it is required for pandas_udf):

from pyarrow import hdfs

fs = hdfs.connect(host, port)
fs.delete(some_path, recursive=True)

这篇关于pyspark 和 HDFS 命令的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆