pyspark和HDFS命令 [英] pyspark and HDFS commands

查看：108 发布时间：2020/9/4 1:41:00 python apache-spark hdfs pyspark

本文介绍了pyspark和HDFS命令的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想在我的Spark程序(Pyspark)开始时进行一些清理.例如，我想从以前的HDFS运行中删除数据.在Pig中，可以使用

I would like to do some cleanup at the start of my Spark program (Pyspark). For example, I would like to delete data from previous HDFS run. In pig this can be done using commands such as

fs -copyFromLocal ....

rmf /path/to-/hdfs

或在本地使用sh命令.

or locally using sh command.

我想知道如何使用Pyspark.

I was wondering how to do the same with Pyspark.

推荐答案

您可以使用表单示例 sh库所以这样的事情应该可以正常工作:

You can execute arbitrary shell command using form example subprocess.call or sh library so something like this should work just fine:

import subprocess

some_path = ...
subprocess.call(["hadoop", "fs", "-rm", "-f", some_path])

如果您使用Python 2.x，则可以尝试使用 spotify/snakebite :

If you use Python 2.x you can try using spotify/snakebite:

from snakebite.client import Client

host = ...
port = ...
client = Client(host, port)
client.delete(some_path, recurse=True)

hdfs3 是另一个可用于执行相同操作的库:

hdfs3 is yet another library which can be used to do the same thing:

from hdfs3 import HDFileSystem

hdfs = HDFileSystem(host=host, port=port)
HDFileSystem.rm(some_path)

Apache Arrow Python绑定是最新选项(pandas_udf必需的，并且通常在Spark集群上已经可用):

Apache Arrow Python bindings are the latest option (and that often is already available on Spark cluster, as it is required for pandas_udf):

from pyarrow import hdfs

fs = hdfs.connect(host, port)
fs.delete(some_path, recursive=True)

这篇关于pyspark和HDFS命令的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

pyspark和HDFS命令 [英] pyspark and HDFS commands

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

pyspark和HDFS命令 [英] pyspark and HDFS commands

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭