使用 mapReduce 和 hadoop 提取包含特定值的行 [英] Extracting rows containing specific value using mapReduce and hadoop

查看:16
本文介绍了使用 mapReduce 和 hadoop 提取包含特定值的行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是开发 map-reduce 功能的新手.考虑我有包含四列数据的 csv 文件.

I'm new to developing map-reduce function. Consider I have csv file containing four column data.

例如:

101,87,65,67  
102,43,45,40  
103,23,56,34  
104,65,55,40  
105,87,96,40  

现在,我要提取物说

40 102  
40 104  
40 105  

因为那些行在第四列中包含 40.

as those row contain 40 in forth column.

map reduce函数怎么写?

How to write map reduce function?

推荐答案

基本 WordCount 示例非常类似于您要实现的目标.而不是初始化每个单词的计数,您应该有一个条件来检查标记化的字符串是否具有所需的值,并且只有在这种情况下您才写入上下文.这将起作用,因为 Mapper 将分别接收 CSV 的每一行.

Basically WordCount example resembles very well what you are trying to achieve. Instead of initializing the count per each word, you should have a condition to check if the tokenized String has required value and only in that case you write to context. This will work, since Mapper will receive each line of the CSV separately.

现在 Reducer 将接收已按键组织的值列表.在 Reducer 中,您可以使用 NullWritable 作为返回值类型,而不是将 IntWritable 作为输出值类型,因此您的代码只会输出键.此外,您也不需要 Reducer 中的循环,因为您只想输出键.

Now Reducer will receive the list of the values, already organized per key. In Reducer, instead of having IntWritable as output value type, you can use NullWritable for return value type, so your code will only output the keys. Also you do not need the cycle in Reducer, since you only would like to output the keys.

我没有在我的回答中为您提供任何代码,因为您不会从中学到任何东西.远离建议.

I do not provide you any code in my answer, since you will learn nothing from that. Make you way from the recommendations.

由于您修改了对 Reducer 的请求,因此这里有一些提示如何实现您想要的.

实现所需结果的一种可能性是:在 Mapper 中,在拆分(或 tekenizing)行之后,将上下文第 3 列作为键写入,将第 0 列作为值写入.您的 Reducer,因为您不需要任何类型的聚合,所以可以简单地编写 Mappers 生成的键和值(是的,您的 Reducer 代码将最终得到一行代码).您可以查看我之前的答案,那里的图很好地解释了 Map 和 Reduce 阶段正在做什么.

One of the possibilities for achiving desired result is: in Mapper, after splitting (or tekenizing) the line, you write to context column 3 as key and column 0 as value. Your Reducer, since you do not need to any kind of aggregation, can simply write the keys and values produced by Mappers (yep, your Reducer code will end up with a single line of code). You can check one of my previous answers, the figure there explains quite well what Map and Reduce phases are doing.

这篇关于使用 mapReduce 和 hadoop 提取包含特定值的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆