通过Python处理HDFS中的多个文件 [英] Processing multiple files in HDFS via Python

查看：806 发布时间：2018/5/31 19:38:43 python hadoop scripting hdfs

本文介绍了通过Python处理HDFS中的多个文件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我在HDFS中有一个包含大约10,000个.xml文件的目录。我有一个python脚本processxml.py，它需要一个文件并对其进行一些处理。是否可以在hdfs目录中的所有文件上运行脚本，还是需要先将它们复制到本地才能执行此操作？

例如，当我在本地目录中的文件上运行脚本时，我有：

  cd / path / to / files 
 
用于* .xml文件
 do 
 python /path/processxml.py 
 $ file> ; / path2 / $ file 
完成

所以基本上，我该如何去做同样的事情，但这次文件是在hdfs？

解决方案

您基本上有两种选择：1）使用hadoop流连接器来创建MapReduce作业（在这里您只需要地图部分）。使用shell或shell脚本中的这个命令：

  hadoop jar< streamlib的位置> \ 
 -D mapred.job.name =<作业名称> \ 
 -input / hdfs / input / dir \ 
 -output / hdfs / output / dir \ 
 -file your_script.py \ 
 -mapper python your_script。 py \ 
 -numReduceTasks 0

2）创建一个PIG脚本并发布您的Python代码。下面是脚本的一个基本示例：

  input_data = LOAD'/ hdfs / input / dir'; 
 DEFINE mycommand`python your_script.py` ship（'/ path / to / your / script.py'）; 
 updated_data = STREAM input_data通过mycommand PARALLEL 20; 
 STORE updated_data INTO'hdfs / output / dir';

I have a directory in HDFS that contains roughly 10,000 .xml files. I have a python script "processxml.py" that takes a file and does some processing on it. Is it possible to run the script on all of the files in the hdfs directory, or do I need to copy them to local first in order to do so?

For example, when I run the script on files in a local directory I have:
cd /path/to/files for file in *.xml do python /path/processxml.py $file > /path2/$file done
So basically, how would I go about doing the same, but this time the files are in hdfs?
解决方案
You basically have two options:

1) Use hadoop streaming connector to create a MapReduce job (here you will only need the map part). Use this command from the shell or inside a shell script:
hadoop jar <the location of the streamlib> \ -D mapred.job.name=<name for the job> \ -input /hdfs/input/dir \ -output /hdfs/output/dir \ -file your_script.py \ -mapper python your_script.py \ -numReduceTasks 0
2) Create a PIG script and ship your python code. Here is a basic example for the script:
input_data = LOAD '/hdfs/input/dir'; DEFINE mycommand `python your_script.py` ship('/path/to/your/script.py'); updated_data = STREAM input_data THROUGH mycommand PARALLEL 20; STORE updated_data INTO 'hdfs/output/dir';

这篇关于通过Python处理HDFS中的多个文件的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

通过Python处理HDFS中的多个文件 [英] Processing multiple files in HDFS via Python

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

通过Python处理HDFS中的多个文件 [英] Processing multiple files in HDFS via Python

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭