记录读取器和记录边界 [英] Record Reader and Record Boundaries

查看:124
本文介绍了记录读取器和记录边界的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我有一个输入文件,并且在HDFS中为这个文件创建了三个块。假设我有三个数据节点,每个数据节点存储一个数据块。如果我有3个输入拆分,3个映射器将并行运行,以处理各个数据节点本地的数据。每个映射器使用输入格式和记录读取器根据键值对获取输入。这种情况下使用TextInputFormat,其中记录是来自文件的完整文本行。


这里的问题是如果在第一个块的末尾有记录中断会发生什么

<
1)Hadoop如何在此场景中读取完整记录?


2)数据节点1是否联系数据节点2以获取完整记录?


3)如果数据节点2开始处理数据并在第一行中标识不完整的记录,会发生什么情况?

Suppose I have one input file and there are three blocks created in HDFS for this file. Assuming I have three data nodes and each data node is storing one block. If I have 3 input splits, 3 mappers will be running in parallel to process the data local to the respective data nodes. Each mapper gets input in terms of key value pairs using Input Format and Record Reader. This scenario with TextInputFormat where the record is complete line of text from file.

Question here is what happens if there is record break at the end of the first block.

1) How Hadoop reads the complete record in this scenario?

2) Does data node1 contact the data node 2 to get the complete record?

3) What happens if Data node 2 started processing the data and identifies the incomplete record in first line?

推荐答案


  1. Hadoop将继续读取第一个块的末尾, EOL字符或EOF已到达。

  2. 数据复制之外的数据节点之间不通信(当由名称节点指示时)。 HDFS客户端将从node1和node2读取数据

  3. 一些例子阐明了

    • 如果你有一行记录跨越一个300MB的文件128MB块大小 - Mapper 2和3将从文件的给定分割偏移量(分别为128MB和256MB)开始读取。他们都会跳过前进,试图找到下一个EOL字符,并从那里开始记录。在这个例子中,两个映射器实际上都会处理0条记录。
    • 一个300MB的文件,两行150MB,128MB大小 - 映射器1将处理第一行,找到EOL字符在块2中。映射器2将从偏移量128MB开始(块2)并向前扫描以找到偏移150MB的EOL字符。它将向前扫描并在块3之后找到EOF并处理这些数据。 Mapper 3将从偏移量256MB开始(块3),然后向前扫描EOF字符,然后处理0记录。
    • 300MB文件,每行长度为50MB :第1行(0-> 50),第2行(50-> 100),第3行(100-第150行) li>
      2 - 偏移128 MB - > 256 MB,第4行(150-> 200),5(200-> 250),6(250-> 300)
    • 映射器3 - 偏移量256 MB - > 300 MB,0行



  1. Hadoop will continue to read past the end of the first block until the EOL character or EOF is reached.
  2. That data nodes do not communicate with each other outside of data replication (when instructed by the name node). The HDFS client will read data from node1 then node2
  3. Some examples to clarify
    • If you have a single line record spanning a 300MB file with 128MB block size - Mapper 2 and 3 will start reading from a given split offset of the file (128MB and 256MB respectively). They will both skip forward trying to find the next EOL character and start there records from that point. In this example, both mappers will actually process 0 records.
    • A 300MB file with two lines 150MB in length, 128 MB block size - mapper 1 will process the first line, finding the EOL character in block 2. Mapper 2 will start from offset 128MB (block 2) and scan forward to find the EOL character at offset 150MB. It will scan forward and find the EOF after block 3 and process this data. Mapper 3 will start at offset 256MB (block 3) and scan forward to the EOF before hitting a EOL character, and hence process 0 records
    • A 300MB file with 6 lines, each 50MB in length:
      • mapper 1 - offset 0 -> 128MB, lines 1 (0->50), 2 (50->100), 3 (100->150)
      • mapper 2 - offset 128 MB -> 256 MB, lines 4 (150->200), 5 (200->250), 6 (250->300)
      • mapper 3 - offset 256 MB -> 300 MB, 0 lines

希望有帮助

这篇关于记录读取器和记录边界的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆