使用mapReduce和hadoop提取包含特定值的行 [英] Extracting rows containing specific value using mapReduce and hadoop

查看:494
本文介绍了使用mapReduce和hadoop提取包含特定值的行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是开发map-reduce功能的新手。考虑我有包含四列数据的 csv 文件。

例如:

  101,87,65,67 
102,43,45,40
103,23,56,34
104,65,55,40
105,87,96,40

现在,我想提取说

  40 102 
40 104
40 105

,因为那些行在第四列包含 40



如何编写map reduce函数?

解决方案

基本上 WordCount 示例非常类似于您尝试实现的内容。您应该有一个条件来检查标记化的字符串是否具有所需的值,并且只有在您写入上下文的情况下才能检查每个字的计数,而不是初始化每个字的计数。这将起作用,因为Mapper将分别接收每行的CSV。



现在Reducer将收到已经按照键组织的值列表。在Reducer中,您可以使用NullWritable作为返回值类型,而不是将IntWritable用作输出值类型,因此您的代码只会输出键。你也不需要Reducer中的循环,因为你只想输出键。



我的答案中没有提供任何代码,因为您将学习没有什么。

编辑:由于您对Reducer的要求进行了修改,因此以下是一些有关如何实现您想要的技巧的提示。

/ b>

获得所需结果的可能性之一是:在Mapper中,在分割(或变形)该行之后,您将上下文列3作为键和列0作为值。你的Reducer,因为你不需要任何类型的聚合,可以简单地写出由Mappers产生的键和值(是的,你的Reducer代码将以一行代码结束)。您可以检查我以前的一个答案,那里的数字很好地解释了Map和Reduce阶段正在做什么。

I'm new to developing map-reduce function. Consider I have csv file containing four column data.

For example:

101,87,65,67  
102,43,45,40  
103,23,56,34  
104,65,55,40  
105,87,96,40  

Now, I want extract say

40 102  
40 104  
40 105  

as those row contain 40 in forth column.

How to write map reduce function?

解决方案

Basically WordCount example resembles very well what you are trying to achieve. Instead of initializing the count per each word, you should have a condition to check if the tokenized String has required value and only in that case you write to context. This will work, since Mapper will receive each line of the CSV separately.

Now Reducer will receive the list of the values, already organized per key. In Reducer, instead of having IntWritable as output value type, you can use NullWritable for return value type, so your code will only output the keys. Also you do not need the cycle in Reducer, since you only would like to output the keys.

I do not provide you any code in my answer, since you will learn nothing from that. Make you way from the recommendations.

EDIT: since you modified you question with request for Reducer, here are some tips how you can achieve what you want.

One of the possibilities for achiving desired result is: in Mapper, after splitting (or tekenizing) the line, you write to context column 3 as key and column 0 as value. Your Reducer, since you do not need to any kind of aggregation, can simply write the keys and values produced by Mappers (yep, your Reducer code will end up with a single line of code). You can check one of my previous answers, the figure there explains quite well what Map and Reduce phases are doing.

这篇关于使用mapReduce和hadoop提取包含特定值的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆