处理蜂巢中的换行符 [英] handling newline character in hive

查看:87
本文介绍了处理蜂巢中的换行符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经在配置单元中创建了一个表格。

I have created a table in hive as

Create table(id int, Description String)  

我的数据看起来如下:

My data looks something as follows :

 
1|This will return corrupt data since there is a ',' in the first string.
     some text
     Change the data  
2|There is prob in reading data 
    sometext

在将数据加载到配置单元中之后,由于默认行结束符是\ n,因此说明列不能被hive读取,因此它显示NULL值。任何人都可以建议如何处理新行之前加载到配置单元。

After the data is loaded into hive since the default line terminator is \n, the description column cannot be read by hive, Hence it displays a NULL value. Can anyone suggest how to handle newline before loading into hive.

推荐答案

我知道这个问题是旧的,但你有几个选项。你不能用 FIELDS TERMINATED BY 来控制它,因为它只控制字段的终止,而不是记录。 Hive中的记录被硬编码以被换行符终止(即使有 LINES TERMINATED BY 子句,它也没有实现)。

I know this question is old, but you have a couple of options. You can't control this with FIELDS TERMINATED BY, because that only controls what terminates the fields, not the records. Records in Hive are hard-coded to be terminated by the newline character (even though there is a LINES TERMINATED BY clause, it is not implemented).


  1. 编写一个使用 RecordReader 的自定义 InputFormat
    理解非换行符分隔的记录。查看
    LineReader / LineRecordReader TextInputFormat

  2. 使用除文本/ ASCII之外的格式
    ,如Parquet。无论如何,我会推荐这个
    ,因为文本可能是您可以存储数据
    的最差格式。

  1. Write a custom InputFormat that uses a RecordReader that understands non-newline delimited records. Look at the code for LineReader/LineRecordReader and TextInputFormat.
  2. Use a format other than text/ASCII, like Parquet. I would recommend this regardless, as text is probably the worst format you can store data in anyway.

这篇关于处理蜂巢中的换行符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆