通过Python处理HDFS中的多个文件 [英] Processing multiple files in HDFS via Python

查看:806
本文介绍了通过Python处理HDFS中的多个文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在HDFS中有一个包含大约10,000个.xml文件的目录。我有一个python脚本processxml.py,它需要一个文件并对其进行一些处理。是否可以在hdfs目录中的所有文件上运行脚本,还是需要先将它们复制到本地才能执行此操作?



例如,当我在本地目录中的文件上运行脚本时,我有:

  cd / path / to / files 

用于* .xml文件
do
python /path/processxml.py
$ file> ; / path2 / $ file
完成

所以基本上,我该如何去做同样的事情,但这次文件是在hdfs?

解决方案

您基本上有两种选择:1)使用hadoop流连接器来创建MapReduce作业(在这里您只需要地图部分)。使用shell或shell脚本中的这个命令:

  hadoop jar< streamlib的位置> \ 
-D mapred.job.name =<作业名称> \
-input / hdfs / input / dir \
-output / hdfs / output / dir \
-file your_script.py \
-mapper python your_script。 py \
-numReduceTasks 0

2)创建一个PIG脚本并发布您的Python代码。下面是脚本的一个基本示例:

  input_data = LOAD'/ hdfs / input / dir'; 
DEFINE mycommand`python your_script.py` ship('/ path / to / your / script.py');
updated_data = STREAM input_data通过mycommand PARALLEL 20;
STORE updated_data INTO'hdfs / output / dir';


I have a directory in HDFS that contains roughly 10,000 .xml files. I have a python script "processxml.py" that takes a file and does some processing on it. Is it possible to run the script on all of the files in the hdfs directory, or do I need to copy them to local first in order to do so?

For example, when I run the script on files in a local directory I have:

cd /path/to/files

for file in *.xml
do
python  /path/processxml.py 
$file > /path2/$file
done

So basically, how would I go about doing the same, but this time the files are in hdfs?

解决方案

You basically have two options:

1) Use hadoop streaming connector to create a MapReduce job (here you will only need the map part). Use this command from the shell or inside a shell script:

hadoop jar <the location of the streamlib> \
        -D mapred.job.name=<name for the job> \
        -input /hdfs/input/dir \
        -output /hdfs/output/dir \
        -file your_script.py \
        -mapper python your_script.py \
        -numReduceTasks 0

2) Create a PIG script and ship your python code. Here is a basic example for the script:

input_data = LOAD '/hdfs/input/dir';
DEFINE mycommand `python your_script.py` ship('/path/to/your/script.py');
updated_data = STREAM input_data THROUGH mycommand PARALLEL 20;    
STORE updated_data INTO 'hdfs/output/dir';

这篇关于通过Python处理HDFS中的多个文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆