在 hive 中处理换行符 [英] handling newline character in hive

查看:96
本文介绍了在 hive 中处理换行符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 hive 中创建了一个表

创建表(id int, Description String)

我的数据如下所示:

<前>1|这将返回损坏的数据,因为第一个字符串中有一个,".一些文字更改数据2|读取数据有问题一些文本

数据加载到 hive 后,由于默认的行终止符是 ,hive 无法读取描述列,因此它显示一个 NULL 值.任何人都可以建议如何在加载到配置单元之前处理换行符.

解决方案

我知道这个问题很老了,但您有几个选择.你不能用 FIELDS TERMINATED BY 控制它,因为它只控制终止字段的内容,而不控制记录.Hive 中的记录被硬编码为由换行符终止(即使有 LINES TERMINATED BY 子句,它也没有实现).

  1. 编写一个使用 RecordReader 的自定义 InputFormat理解非换行符分隔的记录.看代码LineReader/LineRecordReaderTextInputFormat.
  2. 使用格式除了文本/ASCII,如 Parquet.我会推荐这个无论如何,因为文本可能是您可以存储数据的最糟糕的格式无论如何.

I have created a table in hive as

Create table(id int, Description String)  

My data looks something as follows :

 
1|This will return corrupt data since there is a ',' in the first string.
     some text
     Change the data  
2|There is prob in reading data 
    sometext

After the data is loaded into hive since the default line terminator is , the description column cannot be read by hive, Hence it displays a NULL value. Can anyone suggest how to handle newline before loading into hive.

解决方案

I know this question is old, but you have a couple of options. You can't control this with FIELDS TERMINATED BY, because that only controls what terminates the fields, not the records. Records in Hive are hard-coded to be terminated by the newline character (even though there is a LINES TERMINATED BY clause, it is not implemented).

  1. Write a custom InputFormat that uses a RecordReader that understands non-newline delimited records. Look at the code for LineReader/LineRecordReader and TextInputFormat.
  2. Use a format other than text/ASCII, like Parquet. I would recommend this regardless, as text is probably the worst format you can store data in anyway.

这篇关于在 hive 中处理换行符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆