处理蜂巢中的换行符 [英] handling newline character in hive
问题描述
我已经在配置单元中创建了一个表格。
I have created a table in hive as
Create table(id int, Description String)
我的数据看起来如下:
My data looks something as follows :
1|This will return corrupt data since there is a ',' in the first string.
some text
Change the data
2|There is prob in reading data
sometext
在将数据加载到配置单元中之后,由于默认行结束符是\ n,因此说明列不能被hive读取,因此它显示NULL值。任何人都可以建议如何处理新行之前加载到配置单元。
After the data is loaded into hive since the default line terminator is \n, the description column cannot be read by hive, Hence it displays a NULL value. Can anyone suggest how to handle newline before loading into hive.
推荐答案
我知道这个问题是旧的,但你有几个选项。你不能用 FIELDS TERMINATED BY
来控制它,因为它只控制字段的终止,而不是记录。 Hive中的记录被硬编码以被换行符终止(即使有 LINES TERMINATED BY
子句,它也没有实现)。
I know this question is old, but you have a couple of options. You can't control this with FIELDS TERMINATED BY
, because that only controls what terminates the fields, not the records. Records in Hive are hard-coded to be terminated by the newline character (even though there is a LINES TERMINATED BY
clause, it is not implemented).
- 编写一个使用
RecordReader
的自定义InputFormat
理解非换行符分隔的记录。查看
LineReader
/LineRecordReader
和TextInputFormat
。 - 使用除文本/ ASCII之外的格式
,如Parquet。无论如何,我会推荐这个
,因为文本可能是您可以存储数据
的最差格式。
- Write a custom
InputFormat
that uses aRecordReader
that understands non-newline delimited records. Look at the code forLineReader
/LineRecordReader
andTextInputFormat
. - Use a format other than text/ASCII, like Parquet. I would recommend this regardless, as text is probably the worst format you can store data in anyway.
这篇关于处理蜂巢中的换行符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!