Java Mapreduce对复合值进行排序 [英] Java Mapreduce sort composite value

查看:141
本文介绍了Java Mapreduce对复合值进行排序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个映射器,它发出一个Text(水果名称)键和一个自定义复合值city:count。我想在计数器到达减速器之前对合成值进行排序,这样减速器可以快速确定哪个城市的计数最高。

复合值类是WritableComparable的扩展,并且有检索计数和城市的方法。



减速器当前接收到的数据:

  reducer 1  -  oranges:< london:2,chicago:15,charleston:6> 
reducer 2 - 苹果:< charleston:31,london:3,chicago:29>
...

我想让我的减速器收到:

  reducer 1-橘子:< chicago:15,charleston:6,london:2> 
reducer 2 - 苹果:< charleston:31,chicago:29,london:3>

从逻辑上讲,我该如何做到这一点?我读过几篇有关Secondary Sorting / Ordering的文章,但他们倾向于关注复合键而不是复合值。我的密钥不需要进一步分区,也不需要进一步分类。



再次,按复合VALUE排序而不是组合键!

解决方案

如果您只是为了快速确定水果的最高含量,我想推荐另一种方法。由于在大多数情况下排序的复杂度为 O(n log n),而找到最大的条目只有 O(n)其中 n 是您的案例中的城市数量。

1。内存映射器



您可以在每个映射器中使用散列映射来确定每个映射器每个水果的最高数量。只需使用水果作为关键和城市+计数作为价值。当你看到地图上的水果,比较大的时候。如果水果不存在,你显然必须设置它。
当所有的map步骤都被执行时,框架会调用mapper的清理方法。在清理中,您可以发出地图的条目。这将减少你必须发送和通过减速机显着数值。

2。 Combiner



方法1.有一个显着的缺点。如果你有大量的水果不适合记忆,它是不可扩展的。如果是这种情况,您可以使用在映射器端执行的组合器。它对于相应的映射器给出的一组较小的数据就像一个简化器一样工作。这也会导致您发送给减速器的数量减少的好处。

3。次要订购



您可以通过二次订购来完成。我真的很想鼓励你阅读Preeti Khurana提供的文章。特别是 Sudarshan的回答。给你一个简要的想法:使用水果的复合关键:count和city:count的值。请注意,您需要基于密钥的第一部分进行特殊分区。我认为这将是一个很大的努力,但在某些情况下,这是有用的和必要的。


I have a mapper that emits a Text (fruit name) key and a custom composite value city:count. I want to sort the composite values by the count prior to it arriving to the reducer such that the reducer can quickly determine which city has the highest count.

The composite value class is an extension of WritableComparable and has methods for retrieving the count and city.

What my reducer currently receives:

reducer 1 - oranges:<london:2, chicago:15, charleston:6>
reducer 2 - apples:<charleston:31, london:3, chicago:29>
...

What I want my reducer to receive:

reducer 1 - oranges:<chicago:15, charleston:6, london:2>
reducer 2 - apples:<charleston:31, chicago:29, london:3>

Logically, how do I make this happen? I've read several articles on Secondary Sorting/Ordering, but they tend to focus on composite keys as opposed to composite values. My keys don't need furthering partitioning nor do they need further sorting.

Again, sorting by a composite VALUE not a composite key!

解决方案

If you are only aiming at fast determination of the highest amount of a fruit i'd like to recommend another approach. Since sorting in most cases has a complexity of O(n log n) while finding the biggest entry only has O(n) where n is the number of cities in your case.

1. Mapper with Memory

You can use a hashmap in each mapper to determine the highest amount for each fruit per mapper. Just use fruit as key and city+count as value. When you get a fruit look into the map to compare for the bigger one. If the fruit did not already exist you obviously have to set it. When all map steps are executed the framework calls the cleanup method of your mapper. In the cleanup you can emit the entries of the map. This will reduce the number of values you have to send and go through in the reducer significantly.

2. Combiner

The approach 1. has one significant draw back. It is not scalable if you have a high amount of fruits which didn't fit into the memory. If this is the case you can use a combiner which is executed at mapper side. It works like a reducer for a smaller set of data given by the corresponding mapper. This would also lead to the benefit of a reduced number of values you send to the reducer.

3. Secondary Ordering

You can do it with secondary ordering. I really like to encourage you to read the article provided by Preeti Khurana. Especially the answer of Sudarshan. To give you a brief idea: Use a composite key of fruit:count and the value of city:count. Be aware that you need a special partitioning based on the first part of the key. I think this would be a high amount of effort but in some cases it is useful and necessary.

这篇关于Java Mapreduce对复合值进行排序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆