pyspark和HDFS命令 [英] pyspark and HDFS commands
问题描述
我想在我的Spark程序(Pyspark)开始时进行一些清理.例如,我想从以前的HDFS运行中删除数据.在Pig中,可以使用
I would like to do some cleanup at the start of my Spark program (Pyspark). For example, I would like to delete data from previous HDFS run. In pig this can be done using commands such as
fs -copyFromLocal ....
rmf /path/to-/hdfs
或在本地使用sh命令.
or locally using sh command.
我想知道如何使用Pyspark.
I was wondering how to do the same with Pyspark.
推荐答案
您可以使用表单示例 sh
库所以这样的事情应该可以正常工作:
You can execute arbitrary shell command using form example subprocess.call
or sh
library so something like this should work just fine:
import subprocess
some_path = ...
subprocess.call(["hadoop", "fs", "-rm", "-f", some_path])
如果您使用Python 2.x,则可以尝试使用 spotify/snakebite
:
If you use Python 2.x you can try using spotify/snakebite
:
from snakebite.client import Client
host = ...
port = ...
client = Client(host, port)
client.delete(some_path, recurse=True)
hdfs3
是另一个可用于执行相同操作的库:
hdfs3
is yet another library which can be used to do the same thing:
from hdfs3 import HDFileSystem
hdfs = HDFileSystem(host=host, port=port)
HDFileSystem.rm(some_path)
Apache Arrow Python绑定是最新选项(pandas_udf
必需的,并且通常在Spark集群上已经可用):
Apache Arrow Python bindings are the latest option (and that often is already available on Spark cluster, as it is required for pandas_udf
):
from pyarrow import hdfs
fs = hdfs.connect(host, port)
fs.delete(some_path, recursive=True)
这篇关于pyspark和HDFS命令的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!