MapReduce(二级)排序/过滤 - 如何? [英] MapReduce (secondary) sorting / filtering - how?

查看:125
本文介绍了MapReduce(二级)排序/过滤 - 如何?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个时间戳值(并发用户)的聊天室webapp的不同区域的日志文件,格式为Timestamp; Zone; Value。对于每个区域,每天每分钟存在一个值。



对于每个区域,我想列出每天的最大值,按此最大值对desc进行排序



因此,一个输入文件

  #timestamp;区;价值
2011-01-01 00:00:00; 1; 10
2011-01-01 00:00:00; 2; 22
2011-01-01 00:01:00; 1; 11
2011-01-01 00:01:00; 2; 21

2011-01-02 00:00:00; 1; 12
2011-01-02 00:00:00; 2; 20

应为区域1生成:

  2011-01-02 12 
2011-01-01 11

和区域2:

  2011-01-01 22 
2011-01-02 20

我会如何处理这个问题?恕我直言,我将需要不止一个M / R步骤。



到目前为止我已经实施的是:


  • 收集文本键YYYY-MM-DD / Zone和IntWritable值value的映射器,以及
  • 标识最大值每个键的价值(即每个区每天)。


    这导致像


    $ b这样的文件$ b

      2011-01-01 / 1 11 
    2011-01-01 / 2 22
    2011-01-02 / 1 12
    2011-01-02 / 2 20

    这是第二个M / R步骤的输入吗?如果是这样,我会采取什么样的关键和价值?



    我研究了Hadoop - 权威指南中的Secondary Sort示例,但我不是



    是否可以将M / R转换为多个输出文件(每个区域一个)?

    更新
    考虑后,我会试试这个:


    • 使关键字成为zone-id和value的组合(使用IntPair?)
    • 编写自定义KeyComparator和GroupComparator
    $ b

    解决方案

您可以使用二次排序仅使用一个MR来完成此操作。以下是步骤




  1. 将键定义为区域yyyy-mm-dd和区域值的拼接:yyyy-mm -dd:value正如我将解释的,你甚至不需要从映射器发出任何值。 NullWritable对于值是足够好的


  2. 实现键比较器,使得键:yyyy-mm-dd部分的键按顺序升​​序,并且值部分是有序的降。这将确保对于给定区域的所有密钥:yyyy-mm-dd,组中的第一个密钥将具有最高值

  3. 基于密钥的区域和日期部分的复合键比较器,即区域:yyyy-mm-dd。在您的缩减器输入中,您将获得关键组的第一个关键点,其中将包含区域,日期和该区域的最大值,日期组合。 Reducer输入的值部分是一个NullWritable列表,可以忽略。


I have a logfile of timestamped values (concurrent users) of different "zones" of a chatroom webapp in the format "Timestamp; Zone; Value". For each zone exists one value per minute of each day.

For each zone, I want to list the maximum value per day, ordered desc by this maximum value

So, an input file of

#timestamp; zone; value
2011-01-01 00:00:00; 1; 10
2011-01-01 00:00:00; 2; 22
2011-01-01 00:01:00; 1; 11
2011-01-01 00:01:00; 2; 21

2011-01-02 00:00:00; 1; 12
2011-01-02 00:00:00; 2; 20

should produce for zone 1:

2011-01-02    12
2011-01-01    11

and for zone 2:

2011-01-01    22
2011-01-02    20

How would I approach this? IMHO I will need more than one M/R step.

What i have implemented so far is:

  • A mapper that collects a Text-key "YYYY-MM-DD/Zone" and a IntWritable value "value", and
  • A reducer that identifies the maximum value per key (i.e. per zone per day).

This results in a file like

2011-01-01/1    11
2011-01-01/2    22
2011-01-02/1    12
2011-01-02/2    20

Would this be the input for a second M/R step? If so, what would I take as key and value?

I have studied the "Secondary Sort" example in "Hadoop - The Definitive Guide", but I'm not sure whether and how to apply this here.

Is it possible to M/R into several output-files (one per zone)?

UPDATE After thinking about it, I will try this:

  • make the key a composite of zone-id and value (using an IntPair?)
  • writing a custom KeyComparator and GroupComparator

解决方案

You can do this with just one MR using secondary sorting. Here are the steps

  1. Define key as concatenation of zone, yyyy-mm-dd and the value as zone:yyyy-mm-dd:value As I will explain, you don't even need to emit any value from the mapper. NullWritable is good enough for the value

  2. Implement key comparator such that zone:yyyy-mm-dd part of the key is ordered ascending and the values part is ordered descending. This will ensure that for all keys for given zone:yyyy-mm-dd, the first key in the group will have the highest value

  3. Define partitioner and grouping comparator of the composite key based on the zone and day part of the key only i.e. zone:yyyy-mm-dd.

  4. In your reducer input, you will get the first key for a key group, which will contain zone, day and the max value for that zone, day combination. The value part of the reducer input will be a list of NullWritable, which can be ignored.

这篇关于MapReduce(二级)排序/过滤 - 如何?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆