Hadoop FileSplit阅读 [英] Hadoop FileSplit reading

查看:208
本文介绍了Hadoop FileSplit阅读的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设客户端应用程序使用 FileSplit 对象来读取相应文件中的实际字节。



<为此,必须通过 FileSplit 来创建一个 InputStream 对象,如下所示:

  FileSplit split = ... // FileSplit引用
FileSystem fs = ... // HDFS引用

FSDataInputStream fsin = fs.open(split.getPath());

long start = split.getStart() - 1; //开始之前的字节

if(start> = 0)
{
fsin.seek(start);

$ / code>

流调整为-1会出现在一些场景中,例如 Hadoop MapReduce LineRecordReader 类。但是, FSDataInputStream seek()方法的文档明确指出,在寻找位置之后,下一次读取(?),上面的代码将是1字节关闭(?)。

所以,问题是,这个-1调整对所有InputSplit阅读案例是必需的吗?



顺便说一句,如果你想正确地读取 FileSplit 寻求启动是不够的,因为每个拆分的结尾也可能与实际的HDFS文件的结尾不一致。因此,相应的 InputStream 应该是有界的,即具有最大长度,如下所示:

<$ p $ InputStream is = new BoundedInputStream(fsin,split.getLength());

在这种情况下,在本地 fsin steam已在上面创建,使用 org.apache.commons.io.input.BoundedInputStream 类来实现边界。



UPDATE



显然,调整仅适用于用例行中的 LineRecordReader 类,它超出了分割的边界,以确保它读取完整的最后一行。



有关这方面的详细信息可以在上一个问题< a>以及 MAPREDUCE-772 的评论。

解决方案

寻找位置0意味着下一次调用InputStream.read()将读取字节0.寻求位置-1将最有可能抛出一个异常。



具体而言,您是指何时在示例和源代码中讨论标准模式?

如你所注意的,分割并不是必须的,例如TextInputFormat和可以分割的文件。处理拆分的记录阅读器将会:


  • 寻找开始索引,然后找到下一个换行符
  • 找到下一个换行符(或EOF)并将该行作为下一条记录返回


    重复此操作直到发现的下一个换行符已经超过了拆分的结尾,或者找到了EOF。所以你看到我这种情况下,分割的实际边界可能与输入分割给出的实际边界偏移

    更新



    从LineRecordReader引用这个代码块:

      if(codec!= null) {
    in = new LineReader(codec.createInputStream(fileIn),job);
    end = Long.MAX_VALUE;
    } else {
    if(start!= 0){
    skipFirstLine = true;
    --start;
    fileIn.seek(start);
    }
    in = new LineReader(fileIn,job);
    }
    if(skipFirstLine){//跳过第一行并重新建立start。
    start + = in.readLine(new Text(),0,
    (int)Math.min((long)Integer.MAX_VALUE,end - start));

    - start 语句最有可能是为了避免在换行符上开始分割并返回空行作为第一条记录。您可以看到,如果发生搜索,则跳过第一行以确保文件拆分不会返回重叠记录。


    Assume a client application that uses a FileSplit object in order to read the actual bytes from the corresponding file.

    To do so, an InputStream object has to be created from the FileSplit, via code like:

        FileSplit split = ... // The FileSplit reference
        FileSystem fs   = ... // The HDFS reference
    
        FSDataInputStream fsin = fs.open(split.getPath());
    
        long start = split.getStart()-1; // Byte before the first
    
        if (start >= 0)
        {
            fsin.seek(start);
        }
    

    The adjustment of the stream by -1 is present in some scenarios like the Hadoop MapReduce LineRecordReader class. However, the documentation of the FSDataInputStream seek() method says explicitly that, after seeking to a location, the next read will be from that location, meaning (?) that the code above will be 1 byte off (?).

    So, the question is, would that "-1" adjustment be necessary for all InputSplit reading cases?

    By the way, if one wants to read a FileSplit correctly, seeking to its start is not enough, because every split also has an end that may not be identical to the end of the actual HDFS file. So, the corresponding InputStream should be "bounded", i.e. have a maximum length, like the following:

        InputStream is = new BoundedInputStream(fsin, split.getLength());
    

    In this case, after the "native" fsin steam has been created above, the org.apache.commons.io.input.BoundedInputStream class is used, to implement the "bounding".

    UPDATE

    Apparently the adjustment is necessary only for use cases line the one of the LineRecordReader class, which exceeds the boundaries of a split to make sure that it reads the full last line.

    A good discussion with more details on this can be found in an earlier question and in the comments for MAPREDUCE-772.

    解决方案

    Seeking to position 0 will mean the next call to InputStream.read() will read byte 0. Seeking to position -1 will most probably throw an exception.

    Where specifically are you referring to when you talk about the standard pattern in examples and source code?

    Splits are not neccessarily bounded as you note - take TextInputFormat for example and files that can be split. The record reader that processes the split will:

    • Seek to the start index, then find the next newline character
    • Find the next newline character (or EOF) and return that 'line' as the next record

    This repeats until either the next newline found is past the end of the split, or the EOF is found. So you see that i this case the actual bounds of a split might be right shifted from that given by the Input split

    Update

    Referencing this code block from LineRecordReader:

    if (codec != null) {
      in = new LineReader(codec.createInputStream(fileIn), job);
      end = Long.MAX_VALUE;
    } else {
      if (start != 0) {
        skipFirstLine = true;
        --start;
        fileIn.seek(start);
      }
      in = new LineReader(fileIn, job);
    }
    if (skipFirstLine) {  // skip first line and re-establish "start".
      start += in.readLine(new Text(), 0,
                           (int)Math.min((long)Integer.MAX_VALUE, end - start));
    }
    

    The --start statement is most probably to deal with avoiding the split starting on a newline character and returning an empty line as the first record. You can see that if the seek occurs, the first line is skipped to ensure the file splits don't return overlapping records

    这篇关于Hadoop FileSplit阅读的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆