MapReduce如何让Mapper读取xml文件进行查找 [英] MapReduce how to allow Mapper to read an xml file for lookup

查看:97
本文介绍了MapReduce如何让Mapper读取xml文件进行查找的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在我的MapReduce作业中,我将一个产品名称作为字符串参数传递给Mapper。 Mapper.py脚本导入一个名为Process.py的辅助脚本,该脚本用产品名称做某些事情,并返回一些将字符串发送到映射器。然后映射器将这些字符串发送到Hadoop框架,以便它们可以被Reducer拾取。除了以下内容外,一切正常:



Process.py脚本包含一个查找值字典,我想从脚本内部移至xml文件以便更新。我已经在本地进行了测试,如果我在Process.py脚本中包含了Windows路径到xml文件,它工作正常。但是,在Hadoop MapReduce环境中进行测试并不会出于某种原因。



我已经尝试在Process.py脚本中指定xml文档的HDFS路径,并且我尝试在MapReduce作业命令中添加xml文档的名称作为-file参数,但都没有成功。



例如,在Process中。 py,我试过了:
xml_file = r'appers@hdfs.network.com:/nfs_home/appers/cnielsen/product_lookups.xml'



xml_file = r'/ nfs_home / appers / cnielsen / product_lookups.xml'



在MapReduce命令中,我已经将xml文件的名称作为-file参数。例如:
... -file product_lookups.xml -reducer ...


问题是:在MapReduce环境中,如何允许Process.py脚本读取存储在HDFS上的这个xml文档?

解决方案

这是一个端到端的例子,采用上一个问题中提到的技术来更贴近地处理您的问题。



这是一个小型的Python Hadoop Streaming应用程序,值对,根据存储在HDFS中的XML配置文件检查密钥,然后仅在密钥与配置匹配时才发出该值。匹配逻辑被卸载到单独的Process.py模块中,该模块通过使用对 hdfs dfs -cat 的外部调用从HDFS读取XML配置文件。



首先,我们创建一个名为pythonapp的目录,其中包含我们实现的Python源文件。我们稍后会看到当我们提交流式作业时,我们将在 -files 参数中传递此目录。



<为什么我们把文件放到一个中间目录中,而不是在 -files 参数中单独列出每个文件?这是因为当YARN将文件本地化为在容器中执行时,它引入了一层符号链接间接。那么Python无法通过符号链接正确加载模块。解决方案是将两个文件打包到同一个目录中。然后,当YARN将文件本地化时,符号链接间接在目录级而不是单个文件完成。由于主脚本和模块都在物理上位于同一目录中,因此Python将能够正确加载模块。这个问题更详细地解释了这个问题:



Mapper.py

/ h1>

 导入子流程
导入sys
来自流程导入匹配

用于行sys.stdin:
key,value = line.split()
如果匹配(键):$​​ b $ b打印值



Process.py



 导入子流程
导入xml.etree。 ElementTree作为ElementTree

hdfsCatProcess = subprocess.Popen(
['hdfs','dfs','-cat','/pythonAppConf.xml'],
stdout =子流程.PIPE)
pythonAppConfXmlTree = ElementTree.parse(hdfsCatProcess.stdout)
matchString = pythonAppConfXmlTree.find('./ matchString')。text.strip()

def match键):
返回键== matchString
code>

接下来,我们将2个文件放入HDFS中。 / testData是输入文件,包含制表符分隔的键值对。 /pythonAppConf.xml是XML文件,我们可以在其中配置特定的键以匹配。


$ b

/ testData



  foo 1 
bar 2
baz 3



/pythonAppConf.xml



 < pythonAppConf> 
< matchString> foo< / matchString>
< / pythonAppConf>

由于我们将 matchString 设置为 foo ,并且由于我们的输入文件只包含一个记录,并且键设置为 foo ,所以我们期望运行作业的输出成为一行,其中包含与 foo 对应的值,即 1 。将其作为试运行,我们确实得到了预期的结果。

 > hadoop jar share / hadoop / tools / lib / hadoop-streaming  -  *。jar \ 
-D mapreduce.job.reduces = 0 \
-files pythonapp \
--input / testData \
-output / streamingOut \
-mapper'python pythonapp / Mapper.py'

> hdfs dfs -cat / streamingOut / part *
1

另一种做法是应该在 -files 参数中指定HDFS文件。这样,在Python脚本启动之前,YARN会将XML文件作为本地化资源提供给运行容器的单个节点。然后,Python代码可以打开XML文件,就好像它是工作目录中的本地文件一样。对于运行多个任务/容器的非常大的作业,这种技术可能会胜过每个任务调用 hdfs dfs -cat



为了测试这种技术,我们可以尝试使用不同版本的Process.py模块。



Process.py



  import xml.etree.ElementTree as ElementTree 

pythonAppConfXmlTree = ElementTree.parse('pythonAppConf.xml')
matchString = pythonAppConfXmlTree.find ('./matchString')。text.strip()

def match(key):
return key == matchString

命令行调用更改为在 -files 中指定HDFS路径,我们再次看到预期结果。

 > hadoop jar share / hadoop / tools / lib / hadoop-streaming  -  *。jar \ 
-D mapreduce.job.reduces = 0 \
-files pythonapp,hdfs:///pythonAppConf.xml \
-input / testData \
-output / streamingOut \
-mapper'python pythonapp / Mapper.py'

> hdfs dfs -cat / streamingOut / part *
1

Apache Hadoop文档讨论了 -files 选项在本地拉取HDFS文件。



http://hadoop.apache.org/docs/r2.7.1/hadoop-streaming/HadoopStreaming.html #Working_with_Large_Files_and_Archives


In my MapReduce jobs, I pass a product name to the Mapper as a string argument. The Mapper.py script imports a secondary script called Process.py that does something with the product name and returns some emit strings to the Mapper. The mapper then emits those strings to the Hadoop framework so they can be picked up by the Reducer. Everything works fine except for the following:

The Process.py script contains a dictionary of lookup values that I want to move from inside the script to an xml file for easier updating. I have tested this locally and it works fine if I include the Windows path to the xml file in the Process.py script. However, testing this in the Hadoop MapReduce environment doesn't work for some reason.

I have tried specifying the the HDFS path to the xml document inside the Process.py script, and I have tried adding the name of the xml document as a -file argument in the MapReduce job command, but neither have worked.

For example, inside the Process.py, I have tried:
xml_file = r'appers@hdfs.network.com:/nfs_home/appers/cnielsen/product_lookups.xml'
and
xml_file = r'/nfs_home/appers/cnielsen/product_lookups.xml'

In the MapReduce command, I have included the name of the xml file as a -file argument. For example:
... -file product_lookups.xml -reducer ...

Question is: In the MapReduce environment, how do I allow Process.py script to read this xml document that is stored on HDFS?

解决方案

Here is an end-to-end example that adapts the techniques mentioned in this previous question to fit your question more closely.

Python read file as stream from HDFS

This is a small Python Hadoop Streaming application that reads key-value pairs, checks the key against an XML configuration file stored in HDFS, and then emits the value only if the key matches the configuration. The matching logic is off-loaded into a separate Process.py module, which reads the XML configuration file from HDFS by using an external call to hdfs dfs -cat.

First, we create a directory named pythonapp, containing the Python source files for our implementation. We'll see later when we submit the streaming job that we'll pass this directory in the -files argument.

Why do we put the files into an intermediate directory instead of just listing each file separately in the -files argument? That's because when YARN localizes the files for execution in containers, it introduces a layer of symlink indirection. Python then can't load the module correctly through the symlink. The solution is to package both files into the same directory. Then, when YARN localizes the files, the symlink indirection is done at the directory level instead of the individual files. Since both the main script and the module are physically in the same directory, Python will be able to load the module correctly. This question explains the issue in more detail:

How to import a custom module in a MapReduce job?

Mapper.py

import subprocess
import sys
from Process import match

for line in sys.stdin:
    key, value = line.split()
    if match(key):
        print value

Process.py

import subprocess
import xml.etree.ElementTree as ElementTree

hdfsCatProcess = subprocess.Popen(
        ['hdfs', 'dfs', '-cat', '/pythonAppConf.xml'],
        stdout=subprocess.PIPE)
pythonAppConfXmlTree = ElementTree.parse(hdfsCatProcess.stdout)
matchString = pythonAppConfXmlTree.find('./matchString').text.strip()

def match(key):
    return key == matchString

Next, we put 2 files into HDFS. /testData is the input file, containing tab-delimited key-value pairs. /pythonAppConf.xml is the XML file, where we can configure a specific key to match.

/testData

foo 1
bar 2
baz 3

/pythonAppConf.xml

<pythonAppConf>
    <matchString>foo</matchString>
</pythonAppConf>

Since we have set matchString to foo, and since our input file contains only a single record with key set to foo, we expect the output of running the job to be a single line containing the value corresponding to key foo, which is 1. Taking it for a test run, we do get the expected results.

> hadoop jar share/hadoop/tools/lib/hadoop-streaming-*.jar \
      -D mapreduce.job.reduces=0 \
      -files pythonapp \
      -input /testData \
      -output /streamingOut \
      -mapper 'python pythonapp/Mapper.py'

> hdfs dfs -cat /streamingOut/part*
1   

An alternative way to do this would be to specify the HDFS file in the -files argument. This way, YARN will pull the XML file as a localized resource to the individual nodes running the containers before the Python script launches. Then, the Python code can open the XML file as if it was a local file in the working directory. For very large jobs running multiple tasks/containers, this technique is likely to outperform calling hdfs dfs -cat from each task.

To test this technique, we can try a different version of the Process.py module.

Process.py

import xml.etree.ElementTree as ElementTree

pythonAppConfXmlTree = ElementTree.parse('pythonAppConf.xml')
matchString = pythonAppConfXmlTree.find('./matchString').text.strip()

def match(key):
    return key == matchString

The command line invocation changes to specify an HDFS path in -files, and once again, we see the expected results.

> hadoop jar share/hadoop/tools/lib/hadoop-streaming-*.jar \
      -D mapreduce.job.reduces=0 \
      -files pythonapp,hdfs:///pythonAppConf.xml \
      -input /testData \
      -output /streamingOut \
      -mapper 'python pythonapp/Mapper.py'

> hdfs dfs -cat /streamingOut/part*
1   

The Apache Hadoop documentation discusses usage of the -files option to pull HDFS files locally here.

http://hadoop.apache.org/docs/r2.7.1/hadoop-streaming/HadoopStreaming.html#Working_with_Large_Files_and_Archives

这篇关于MapReduce如何让Mapper读取xml文件进行查找的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆