如何将tab指定为hadoop输入文本文件的记录分隔符? [英] How to specify tab as a record separator for hadoop input text file?
问题描述
一种方法它是使用自定义输入格式类来使用过滤器流将原始流中的所有选项卡转换为换行符。但这看起来并不优雅。另一种方法是使用带标签的 java.util.Scanner
作为分隔器。但我不知道如何在输入格式类中使用 java.util.Scanner
类。
什么是最好的方法和替代方案吗?
在组织中硬编码值'\r'和'\\\
'。 apache.hadoop.util.LineReader类,因此不能使用带有制表符分隔记录的TextInputFormat。但是用特殊的LineReader类实现自己的InputFormat并不困难。最简单的解决方案是复制粘贴TextInputFormat,LineRecordReader和LineReader类,将它们移动到您的包并更改LineReader实现。
The input file to my hadoop M/R job is a text file in which the records are separated by tab character '\t' instead of newline '\n'. How can I instruct hadoop to split using the tab character as by default it splits around newlines and each line in the text file is taken as a record.
One way to do it is to use a custom input format class that uses a filter stream to convert all tabs in the original stream to newlines. But this does not look elegant.
Another way would be to use java.util.Scanner
with tab as the separator. But I cannot figure out how to use the java.util.Scanner
class in the input format classes.
What is the best approach and alternatives?
Values '\r' and '\n' hard-coded in org.apache.hadoop.util.LineReader class, so you can't use TextInputFormat with tab-separated records. But it is not difficult to implement own InputFormat with special LineReader class. The simplest solution is to copy-paste TextInputFormat, LineRecordReader and LineReader classes, move them to your package and change LineReader implementation.
这篇关于如何将tab指定为hadoop输入文本文件的记录分隔符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!