Hadoop - map-reduce任务如何知道文件的哪一部分要处理? [英] Hadoop - how are map-reduce tasks know which part of a file to handle?

查看:172
本文介绍了Hadoop - map-reduce任务如何知道文件的哪一部分要处理?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经开始学习hadoop了,目前我正在处理那些结构不太好的日志文件,因为我通常用于M / R键的值通常是在文件(一次)。所以基本上我的映射函数将该值作为关键字,然后扫描文件的其余部分来聚合需要减少的值。所以一个[伪造]日志可能看起来像这样:

  ## log.1 
SOME-KEY
2012-01-01 10:00:01 100
2012-01-02 08:48:56 250
2012-01-03 11:01:56 212
....许多行

## log.2
A-DIFFERENT-KEY
2012-01-01 10:05:01 111
2012-01-02 16 :46:20 241
2012-01-03 11:01:56 287
....许多行

## log.3
SOME- KEY
2012-02-01 09:54:01 16
2012-02-02 05:53:56 333
2012-02-03 16:53:40 208
....更多行

我想为每个键积累第三列。我有几个节点运行这个工作,所以我有几个问题困扰:

1。文件分发



鉴于hadoop的HDFS在64Mb模块中工作(默认情况下),并且每个文件都分布在集群上,我可以确定正确的密钥是匹配的反对正确的数字?也就是说,如果包含密钥的块在一个节点中,并且包含同一个密钥(相同日志的不同部分)的数据的块位于不同的机器上,那么M / R框架如何匹配这两个(if所有)?

2。块分配



对于所描述的文本日志,每个块的截止点是如何决定的?是在一行结束后,还是在64Mb(二进制)?它甚至重要吗?这涉及到我的#1,我担心的是正确的值匹配整个群集上的正确的键。



3。文件结构



M / R处理的最佳文件结构(如果有的话)是什么?如果一个典型的日志看起来像这样,我可能会少得多:

  A-DIFFERENT-KEY 2012-01-01 10:05:01 111 
SOME-KEY 2012-01-02 16:46:20 241
SOME-KEY 2012-01-03 11:01:56 287
A-DIFFERENT- KEY 2012-02-01 09:54:01 16
A-DIFFERENT-KEY 2012-02-02 05:53:56 333
A-DIFFERENT-KEY 2012-02-03 16:53: 40 208
...

然而,日志非常庞大, (时间)将其转换为上述格式。我应该担心吗?

4。作业分配



作业分配是否只有一个JobClient处理整个文件?相反,所有JobClients之间的键/值是如何协调的?再次,我试图保证,我的阴影日志结构仍然会产生正确的结果。

解决方案


鉴于hadoop的HDFS工作在64Mb块(默认情况下),并且每个文件都分布在集群上,我可以确定正确的密钥将匹配正确的数字吗?也就是说,如果包含密钥的块在一个节点中,并且包含同一个密钥(相同日志的不同部分)的数据的块位于不同的机器上,那么M / R框架如何匹配这两个(if )

键和值的映射方式取决于InputFormat类。 Hadoop有一些InputFormat类,也可以定义自定义的InputFormat类。



如果 FileInputFormat,那么映射器的关键是文件偏移,值是输入文件中的行。在大多数情况下,文件偏移被忽略,输入文件中的一行代码被映射器处理。因此,默认情况下,日志文件中的每一行都是映射器的值。



在OP中,可能会有日志文件中的相关数据在块之间进行拆分,每个块将由不同的映射器进行处理,而Hadoop不能与它们关联。一种让单个映射器使用FileInputFormat#isSplitable方法处理完整文件的方法。如果文件太大,这不是一个有效的方法。


对于所描述的文本日志,每个块如何截断点决定?是在一行结束后,还是在64Mb(二进制)?它甚至重要吗?这涉及到我的#1,我担心的是正确的值匹配整个集群上的正确键。

每个HDFS中的块默认大小正好是64MB,除非文件大小小于64MB,或者默认块大小已经被修改,不考虑记录边界。输入中的一部分行可以在一个块中,其余部分在另一个块中。 Hadoop了解记录边界,所以即使记录(行)跨块分割,它仍然只能由一个映射器处理。对于这一些数据传输可能需要从下一个块。


作业分配,使得只有一个JobClient处理整个文件?相反,所有JobClients之间的键/值是如何协调的?再次,我试图保证,我的黑幕日志结构仍然会产生正确的结果。

不完全清楚查询是什么。建议通过一些教程,并回来查询。


I've been starting to learn hadoop, and currently I'm trying to process log files that are not too well structured - in that the value I normally use for the M/R key is typiclly found at the top of the file (once). So basically my mapping function takes that value as key and then scans the rest of the file to aggregate the values needed to be reduced. So a [fake] log might look like this:

## log.1
SOME-KEY
2012-01-01 10:00:01 100
2012-01-02 08:48:56 250
2012-01-03 11:01:56 212
.... many more rows

## log.2
A-DIFFERENT-KEY
2012-01-01 10:05:01 111
2012-01-02 16:46:20 241
2012-01-03 11:01:56 287
.... many more rows

## log.3
SOME-KEY
2012-02-01 09:54:01 16
2012-02-02 05:53:56 333
2012-02-03 16:53:40 208
.... many more rows

I want to accumulate the 3rd column for each key. I have a cluster of several nodes running this job, and so I was bothered by several issues:

1. File Distribution

Given that hadoop's HDFS works in 64Mb blocks (by default), and every file is distributed over the cluster, can I be sure that the correct key will be matched against the proper numbers? That is, if the block containing the key is in one node, and a block containing data for that same key (a different part of the same log) is on a different machine - how does the M/R framework match the two (if at all)?

2. Block Assignment

For text logs such as the ones described, how is each block's cutoff point decided? Is it after a row ends, or exactly at 64Mb (binary)? Does it even matter? This relates to my #1, where my concern is that the proper values are matched with the correct keys over the entire cluster.

3. File structure

What is the optimal file structure (if any) for M/R processing? I'd probably be far less worried if a typical log looked like this:

A-DIFFERENT-KEY 2012-01-01 10:05:01 111
SOME-KEY        2012-01-02 16:46:20 241
SOME-KEY        2012-01-03 11:01:56 287
A-DIFFERENT-KEY 2012-02-01 09:54:01 16
A-DIFFERENT-KEY 2012-02-02 05:53:56 333
A-DIFFERENT-KEY 2012-02-03 16:53:40 208
...

However, the logs are huge and it would be very costly (time) to convert them to the above format. Should I be concerned?

4. Job Distribution

Are the jobs assigned such that only a single JobClient handles an entire file? Rather, how are the keys/values coordinated between all the JobClients? Again, I'm trying to guarentee that my shady log structure still yields correct results.

解决方案

Given that hadoop's HDFS works in 64Mb blocks (by default), and every file is distributed over the cluster, can I be sure that the correct key will be matched against the proper numbers? That is, if the block containing the key is in one node, and a block containing data for that same key (a different part of the same log) is on a different machine - how does the M/R framework match the two (if at all)?

How the keys and the values are mapped depends on the InputFormat class. Hadoop has a couple of InputFormat classes and custom InputFormat classes can also be defined.

If FileInputFormat is used then the key to the mapper is the file off-set and the value is the line in the input file. In most of cases the file off-set is ignored and the value which is a line in the input file is processed by the mapper. So, by default each line in the log file will be a value to to the mapper.

There might be case where related data in a log file as in the OP might be split across blocks, each block will be processed by a different mapper and Hadoop cannot relate them. One way it to let a single mapper process the complete file by using the FileInputFormat#isSplitable method. This is not an efficient approach if the file size is too large.

For text logs such as the ones described, how is each block's cutoff point decided? Is it after a row ends, or exactly at 64Mb (binary)? Does it even matter? This relates to my #1, where my concern is that the proper values are matched with the correct keys over the entire cluster.

Each block in HDFS by default is exactly 64MB size unless the file size is less than 64MB or the default block size has been modfied, record boundaries are not considered. Some part of the line in the input can be in one block and the rest in another. Hadoop understands record boundaries, so even if a record (line) is split across blocks, it will be still processed by a single mapper only. For this some data transfer might be required from the next block.

Are the jobs assigned such that only a single JobClient handles an entire file? Rather, how are the keys/values coordinated between all the JobClients? Again, I'm trying to guarentee that my shady log structure still yields correct results.

Not exactly clear what the query is. Would suggest to go through some tutorials and get back with queries.

这篇关于Hadoop - map-reduce任务如何知道文件的哪一部分要处理?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆