使用hadoop mapreduce决定重复数据消除的键值对 [英] Deciding key value pair for deduplication using hadoop mapreduce

查看:297
本文介绍了使用hadoop mapreduce决定重复数据消除的键值对的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用Hadoop Mapreduce实现文件的重复数据删除。我打算通过计算我的mapper函数中输入目录中存在的所有文件的MD5总和来做到这一点。这些MD5哈希将是reducer的关键,所以具有相同哈希值的文件将转到同一个reducer。

I want to implement deduplication of files using Hadoop Mapreduce. I plan to do it by calculating MD5 sum of all the files present in the input directory in my mapper function. These MD5 hash would be the key to the reducer, so files with the same hash would go to the same reducer.

Hadoop中映射程序的默认值是key是行号,值是文件的内容。

The default for the mapper in Hadoop is that the key is the line number and the value is the content of the file.

另外我读到,如果文件很大,那么它被分割成64 MB的大块,这是Hadoop中最大的块大小。

Also I read that if the file is big, then it is split into chunks of 64 MB, which is the maximum block size in Hadoop.

如何将键值设置为文件的名称,以便在我的映射器中我可以计算文件的哈希值?另外如何确保没有两个节点会计算相同文件的哈希?

How can I set the key values to be the names of the files, so that in my mapper I can compute the hash of the file ? Also how to ensure that no two nodes will compute the hash for the same file?

推荐答案

如果你需要整个文件作为一个映射器的输入,那么您需要保持 isSplitable false。在这种情况下,您可以将整个文件作为映射器的输入,并将MD5应用于相同的密钥。

If you would need to have the entire file as input to one mapper, then you need to keep the isSplitable false. In this scenario you could take in the whole file as input to the mapper and apply your MD5 on the same and emit it as the key.

WholeFileInputFormat (不是hadoop代码的一部分)可以在这里使用。您可以在Hadoop:The Definitive Guide中获得在线实现或其可用性。

WholeFileInputFormat (not a part of the hadoop code) can be used here. You can get the implementation online or its available in the Hadoop: The Definitive Guide book.

值可以是文件名。在Context实例上调用 getInputSplit()会给出可以转换为filesplits的输入分割。然后 fileSplit.getPath()。getName()会给你文件名。这将给你 filename ,可以作为值发出。

Value can be the file name. Calling getInputSplit() on Context instance would give you the input splits which can be cast as filesplits. Then fileSplit.getPath().getName() would yield you the file name. This would give you the filename, which could be emitted as the value.

我没有这样做 - org.apache.hadoop.hdfs.util.MD5FileUtils ,但是javadocs说这可能对你有好处。

I have not worked on this - org.apache.hadoop.hdfs.util.MD5FileUtils, but the javadocs says that this might be what works good for you.

已经包含WholeFileInputFormat和关联的RecordReader的教科书src链接参考

Textbook src link for WholeFileInputFormat and associated RecordReader have been included for reference

1) WholeFileInputFormat

2) WholeFileRecordReader

还包括grepcode

Also including the grepcode link to MD5FileUtils

这篇关于使用hadoop mapreduce决定重复数据消除的键值对的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆