用Hive自定义InputFormat [英] Custom InputFormat with Hive

查看：1006 发布时间：2018/5/31 19:22:13 hadoop hive

本文介绍了用Hive自定义InputFormat的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

更新：好的，原因是下面的工作不正常是因为我使用了更新版本的 InputFormat API（ import org.apache.hadoop.mapred 这是旧的与 import org.apache.hadoop.mapreduce 这是新的）。我的问题是将现有代码移植到新代码中。有没有人有过使用旧API编写多行 InputFormat 的经验？

尝试使用Hadoop / Hive处理Omniture的数据日志文件。文件格式是制表符分隔的，虽然大部分都很简单，但它们确实允许您在字段中使用反斜线（\\ n 和 \\ t ）。因此，我选择创建自己的InputFormat来处理多个换行符，并在Hive尝试在选项卡上执行拆分时将这些制表符转换为空格。我刚刚尝试将一些示例数据加载到Hive中的表中，并得到以下错误：

  CREATE TABLE（... ）
 ROW FORM FORM DELIMITED FIELDS TERMINATED''\ t'
作为INPUTFORMAT存储'OmnitureDataFileInputFormat'OUTPUTFORMAT'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'; 
 
失败：语义分析错误：第1行14行输入格式必须实现InputFormat omniture_hit_data

奇怪的是，我的输入格式确实扩展了 org.apache.hadoop.mapreduce.lib.input.TextInputFormat （ https://gist.github.com/4a380409cd1497602906 ）。

Hive是否要求您扩展 org.apache.hadoop.hive.ql.io.HiveInputFormat 而不是？如果是这样，我是否必须重写InputFormat和RecordReader的任何现有类代码，或者我可以有效地更改它正在扩展的类吗？解决方案在查看LineReader和TextInputFormat的代码后发现这一点。创建一个新的InputFormat来处理这个以及一个EscapedLineReader。

 
 
   https://github.com/msukmanowsky/OmnitureDataFileInputFormat  
 
Update: Alright, it turns out the reason that the below isn't working is because I'm using a newer version of the InputFormat API (import org.apache.hadoop.mapred which is the old versus import org.apache.hadoop.mapreduce which is the new).  The problem I have is porting the existing code to new code.  Has anyone had experience writing a multi-line InputFormat using the old API?



Trying to process Omniture's data log files with Hadoop/Hive.  The file format is tab delimited and while being pretty simple for the most part, they do allow you to have multiple new lines and tabs within a field that are escaped by a backslash (\\n and \\t).  As a result I've opted to create my own InputFormat to handle the multiple newlines and convert those tabs to spaces when Hive is going to try to do a split on the tabs.  I've just tried loading some sample data into the table in Hive and got the following error:
CREATE TABLE (...)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' 
STORED AS INPUTFORMAT 'OmnitureDataFileInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat';

FAILED: Error in semantic analysis: line 1:14 Input Format must implement InputFormat omniture_hit_data
The odd thing is that my input format does extend org.apache.hadoop.mapreduce.lib.input.TextInputFormat (https://gist.github.com/4a380409cd1497602906).  

Does Hive require that you extend org.apache.hadoop.hive.ql.io.HiveInputFormat instead? If so, do I have to rewrite any of my existing class code for the InputFormat and RecordReader or can I effectively just change the class it's extending?
 解决方案 
Figured this out after looking at the code for LineReader and TextInputFormat.  Created a new InputFormat to deal with this as well as an EscapedLineReader.

https://github.com/msukmanowsky/OmnitureDataFileInputFormat

                        这篇关于用Hive自定义InputFormat的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

用Hive自定义InputFormat [英] Custom InputFormat with Hive

问题描述

相关文章

分布式计算/Hadoop最新文章

热门教程

热门工具

登录关闭

用Hive自定义InputFormat [英] Custom InputFormat with Hive

问题描述

相关文章

分布式计算/Hadoop最新文章

热门教程

热门工具

登录 关闭

登录关闭