如何在EC2上运行均线preduce作业时,得到的文件名? [英] How to get filename when running mapreduce job on EC2?

查看:309
本文介绍了如何在EC2上运行均线preduce作业时,得到的文件名?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我学习弹性均线preduce并与亚马逊教程科提供(如下图所示code)的分词例如起步。此例产生的字数在提供的所有输入文档的所有单词。

不过,我想通过文件名来获得输出字数也就是,只有一个特定文档中的单词的数量。由于蟒蛇code字计数需要从标准输入的输入,我怎么知道哪些输入线来自哪个文件?

感谢。

 #!的/ usr / bin中/蟒蛇

进口SYS
进口重

高清主(argv的):
  行= sys.stdin.readline()
  模式= re.compile([A-ZA-Z] [A-ZA-Z0-9] *)
  尝试:
    而行:
      字在pattern.findall(线):
        打印LongValueSum:+ word.lower()+\ t+1
      行= sys.stdin.readline()
  除了文件结尾:
    返回None
如果__name__ ==__main__:
  主(sys.argv中)
 

解决方案

在典型的WORDCOUNT例子,它映射文件正在处理的文件名会被忽略,因为作业输出包含统一字数为所有输入文件而不是在文件级。但要获得在文件级的字数,输入文件名已经被使用。使用Python映射器可以使用 os.environ [map.input.file] 命令获取文件名。任务执行环境变量列表是<一个href="http://hadoop.apache.org/common/docs/current/ma$p$pd_tutorial.html#Configured+Parameters">here.

映射器,而不是仅仅发射键/值对&LT;你好,1&GT; ,还应该包含正在处理的输入文件名。以下可发射由地图&LT; input.txt中,&LT;你好,1&GT;&GT; ,其中input.txt中是关键,&LT;你好,1&GT; 的值

现在,所有为特定文件的字计数将被由单个减速处理。减速器必须再汇总字数为特定文件。

像往常一样,一个组合将有助于降低映射器和减速,也之间的网络喋喋不休来完成作业速度更快。

检查数据密集型文本处理与马preduce 有关文本处理更多的算法。

I am learning elastic mapreduce and started off with the Word Splitter example provided in the Amazon Tutorial Section(code shown below). The example produces word count for all the words in all the input documents provided.

But I want to get output for Word Counts by file names i.e the count of a word in just one particular document. Since the python code for word count takes input from stdin, how do I tell which input line came from which document ?

Thanks.

#!/usr/bin/python

import sys
import re

def main(argv):
  line = sys.stdin.readline()
  pattern = re.compile("[a-zA-Z][a-zA-Z0-9]*")
  try:
    while line:
      for word in  pattern.findall(line):
        print  "LongValueSum:" + word.lower() + "\t" + "1"
      line =  sys.stdin.readline()
  except "end of file":
    return None
if __name__ == "__main__":
  main(sys.argv)

解决方案

In the typical WordCount example, the file name which the map file is processing is ignored, since the the job output contains the consolidated word count for all the input files and not at a file level. But to get the word count at a file level, the input file name has to be used. Mappers using Python can get the file name using the os.environ["map.input.file"] command. The list of task execution environment variables is here.

The mapper instead of just emitting the key/value pair as <Hello, 1>, should also contain the input file name being processed. The following can be the emitted by the map <input.txt, <Hello, 1>>, where input.txt is the key and <Hello, 1> is the value.

Now, all the word counts for a particular file will be processed by a single reducer. The reducer must then aggregate the word count for that particular file.

As usual, a Combiner would help to decrease the network chatter between the mapper and the reducer and also to complete the job faster.

Check Data-Intensive Text Processing with MapReduce for more algorithms on text processing.

这篇关于如何在EC2上运行均线preduce作业时,得到的文件名?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆