mrjob:示例如何自动知道如何在文本文件中查找行? [英] mrjob: how does the example automatically know how to find lines in text file?

查看:118
本文介绍了mrjob:示例如何自动知道如何在文本文件中查找行?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图更好地理解mrjob的示例

I'm trying to understand the example for mrjob better

from mrjob.job import MRJob  
class MRWordFrequencyCount(MRJob):

    def mapper(self, _, line):
        yield "chars", len(line)
        yield "words", len(line.split())
        yield "lines", 1

    def reducer(self, key, values):
        yield key, sum(values)
if __name__ == '__main__':
    MRWordFrequencyCount.run()

我通过

$ python word_count.py my_file.txt

,它按预期工作,但我不知道它如何自动知道它将读取文本文件并将其按行分割.而且我不确定_的作用是什么.

and it works as expected but I don't get how it automatically knows that it's going to read a text file and split it by each line. and I'm not sure what the _ does either.

据我了解,mapper()为正确的每一行生成三个键/值对?如果我要处理文件夹中的每个文件怎么办?

From what I understand, the mapper() generates the three key/value pairs for each line correct? What if I want to work with each file in a folder?

reducer()自动知道如何将每个键的值相加?

And the reducer() automatically know how to add each key's values up?

如果我想通过map reduce运行单元测试,那么mapper和reducer会是什么样?甚至有必要吗?

What if I want to run unit tests via map reduce, what would the mapper and reducer look like? Is it even necessary?

推荐答案

mapper方法接收已从输入文本中解析出的键/值对. mrjob使用Hadoop流,将每个输入文本用换行符分隔,然后根据使用的输入协议将每行分成键/值对.框架为您解决了这一问题,因此您无需执行任何繁重的工作.您可以假设您将获得适当的关键和价值.

The mapper method receives a key-value pair already parsed out from input text. mrjob uses Hadoop streaming, and each input text is divided by the new line character and then each line is split into key-value pair based on an input protocol in use. That's something the framework takes care of for you, so you don't have to do any heavy lifting; you can just assume you will get proper key and value.

但是,您确实需要指定指定哪种输入文本文件.例如,如果键和/或值不是纯文本(如原始问题所示)而是序列化的JSON,则可以使用JSONProtocol/JSONValueProtocol等,而不是默认的RawValueProtocol.

However, you do need to specify what kind of input text files are specified. For example, if the key and/or value are not plain text (as in the original question) but serialized JSON, then you use JSONProtocol/JSONValueProtocol, etc., instead of RawValueProtocol which is the default.

对于初始映射器,每行都被读入值(通过RawValueProtocol),因此这就是为什么您不接收密钥的原因.使用_只是未使用的虚拟变量的Python约定. (但是,_实际上是Python变量的有效名称.您可以执行以下操作a = 3; _ = 2; b = a + _.亵渎神灵,不是吗?)

For the initial mapper, each line is read into value (by RawValueProtocol), so that is why you don't receive key. Using _ is just a Python convention for an unused dummy variable. (However, _ is actually a valid name for a Python variable. You can do something like this a = 3; _ = 2; b = a + _. Blasphemy, isn't it?)

mrjob可以接受多个输入文件.您可以例如

mrjob can take multiple input files. You can do for example

$ python wordcount.py text1.txt text2.txt

如果希望将所有文本文件输入到mrjob作业中,则可以执行类似的操作

If you want all text files as input to an mrjob job, you can do things like

$ python wordcount.py inputdir/*.txt

或者只是

$ python wordcount.py inputdir

和所有选择的文件都用作输入.

and all the files selected are used as input.

reduce接收的是一个密钥,以及与该密钥关联的所有值的迭代器.因此,如果您举个例子,reducer方法中的变量values是一个迭代器.如果要对所有值进行处理,则实际上需要对所有值进行迭代.在问题的特定示例中,内置函数sum可以将迭代器作为参数,这就是为什么您可以一次性完成它.但这实际上类似于sum([value for value in values]).

What reducer receives is a key and the iterator for all the values associated with that key. So if you example, the variable values in the reducer method is an iterator. If you want to do something over all values, you need to actually iterate over all of them. In the specific example in the question, the built-in function sum can take an iterator as an argument, and that's why you can do it in one shot. But it is effectively similar to sum([value for value in values]).

我实际上不知道您将如何对mrjob脚本进行单元测试.通常,在生产运行之前,我通常只对一小部分测试数据进行了测试.

I actually don't know how you would unit test mrjob scripts. I have usually just tested on a small chunk of test data before production run.

这篇关于mrjob:示例如何自动知道如何在文本文件中查找行?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆