使用FileFormat v Serde读取自定义文本文件 [英] Using FileFormat v Serde to read custom text files
问题描述
FileFormat
或一个自定义的 SerDe
类来做到这一点。是这种情况还是我误解了它?什么时候选择哪个选项的一般指导原则是什么?谢谢!我想通了。我没有必要写一个serde,写了一个自定义的InputFormat(扩展 org.apache.hadoop.mapred.TextInputFormat
),它返回一个自定义的RecordReader(implements org.apache.hadoop.mapred.RecordReader< K,V>
)。 RecordReader实现逻辑来读取和解析我的文件,并返回制表符分隔的行。
这样,我将表格声明为
create table t2(
field1 string,
..
fieldNN float)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
存储为INPUTFORMAT'namespace.CustomFileInputFormat'
OUTPUTFORMAT'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';
这使用本地SerDe。此外,使用自定义输入格式时需要指定输出格式,所以我选择了其中一种内置输出格式。
Hadoop/Hive newbie here. I am trying to use data stored in a custom text-based format with Hive. My understanding is you can either write a custom FileFormat
or a custom SerDe
class to do that. Is that the case or am I misunderstanding it? And what are some general guidelines on which option to choose when? Thanks!
I figured it out. I did not have to write a serde after all, wrote a custom InputFormat (extends org.apache.hadoop.mapred.TextInputFormat
) which returns a custom RecordReader (implements org.apache.hadoop.mapred.RecordReader<K, V>
). The RecordReader implements logic to read and parse my files and returns tab delimited rows.
With that I declared my table as
create table t2 (
field1 string,
..
fieldNN float)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS INPUTFORMAT 'namespace.CustomFileInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';
This uses a native SerDe. Also, it is required to specify an output format when using a custom input format, so I choose one of the built-in output formats.
这篇关于使用FileFormat v Serde读取自定义文本文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!