hadoop对密钥进行排序并更改密钥值 [英] hadoop sort the key and change the key value

查看:87
本文介绍了hadoop对密钥进行排序并更改密钥值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在hadoop中,映射器接收密钥作为文件在文件中的位置,例如"0、23、45、76、123",我认为这是字节偏移量.

In hadoop, the mapper receives the key as the position in the file like "0, 23, 45, 76, 123", which I think are byte offsets.

我有两个大的输入文件,在这些文件中,我需要按以下方式进行拆分:文件的相同区域(例如,行数,例如400行)获得相同的密钥.字节偏移显然不是最佳选择.

I have two large input files where I need to split in a manner where the same regions (in terms of number of lines, eg. 400 lines) of the file get the same key. Byte offset is clearly not the best option for that.

我想知道是否存在将键更改为整数的方法或选项,因此输出键将是:"1、2、3、4、5"而不是"0、23、45、76、123" ?

I was wondering if there is a way or option to change the keys to an integer so the output keys will be: "1, 2, 3, 4, 5" instead of "0, 23, 45, 76, 123"?

谢谢!

推荐答案

这是可能的,如果我做对了,那么您想按递增顺序索引所有记录.

That is possible, If i am getting right then you want to index all records in increment order.

我已经做到了.您可以利用框架.这就是我们在GPU中编程的方式. 概述,您可以按每行相同的重新分割数分割文件.这将使您可以索引特定的索引. 分割文件后的公式是

I have done that. You can take advantage of framework. It is how we program in GPU. Overview you can split in file in splits with same number of recored per line. that will allow you to index particular index. Formula after file split is

ActualIndex = splitNubmer * Num_Of_record_Per_Split + record_Offset

现在将详细介绍. 首先使用NLineInputFormat创建拆分,该索引允许索引特定拆分中的记录.发射记录,键为splitId + redordIndex in split + actual record.现在,我们在Map阶段为split编制了索引. 然后,您需要使用自定义SortComaprator,该自定义SortComaprator会按键中的SplitId排序中间输出.然后使用groupComarator服装,该服装将所有键与相同的SplitId分组. 现在,在减速器中,您可以使用以上公式.索引记录. 但是问题是如何识别升序的splitNumber.我解决了. Hadoop按file_HDFS_URL/file_name:StartOffset+Length

Now will go in detail. First create Splits with NLineInputFormat, Which allows to index records in particular split. Emmit record with key as splitId + redordIndex in split + actual record. Now we have indexed split in Map phase. Then you need to use custom SortComaprator which sorts intermediate output by SplitId in key. Then costume groupComarator which groups All keys with same SplitId. Now in reducer you can use above formula. to index records. But problem is How do we Identify splitNumber in ascending order. I solved that by. Hadoop splits file By file_HDFS_URL/file_name:StartOffset+Length

 example: hdfs://server:8020/file.txt:0+400, hdfs://server:8020/file.txt:400+700, and So on.

我在HDFS中创建了一个文件,该文件记录了所有拆分的startOffset.然后在Reducer中使用它. 这种使用方式可以使用完全并行的记录索引.

I created one file in HDFS that record all splits startOffset. Then use it in Reducer. This way use can use fully parallel, Record Indexing.

这篇关于hadoop对密钥进行排序并更改密钥值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆