蜂巢在使用OpenCSVSerde时无法读取字符斜杠 [英] character slash is not being read by hive on using OpenCSVSerde

查看:54
本文介绍了蜂巢在使用OpenCSVSerde时无法读取字符斜杠的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在hdfs中存在的文件之上定义了一个表.我正在使用OpenCSV请求从文件中读取.但是,最终结果集中将省略数据中的"\"斜杠字符.

I have defined a table on top of files present in hdfs. I am using the OpenCSV Serde to read from the file. But, '\' slash characters in the data are getting omitted in the final result set.

是否存在我未正确使用的配置单元Serde属性.根据文档,escapeChar ='\'应该可以解决此问题.但是,问题仍然存在.

Is there a hive serde property that I am not using correctly. As per the documentation, escapeChar = '\' should fix this problem. But, the problem persists.

   CREATE EXTERNAL TABLE `tsr`(
    `last_update_user` string COMMENT 'from deserializer',
    `last_update_datetime` string COMMENT 'from deserializer')
    ROW FORMAT SERDE
    'org.apache.hadoop.hive.serde2.OpenCSVSerde'
    WITH SERDEPROPERTIES (
    'escapeChar'='\',
    'quoteChar'='\"',
    'separatorChar'=',',
    'serialization.encoding'='UTF-8')
    STORED AS INPUTFORMAT
    'org.apache.hadoop.mapred.TextInputFormat'
    OUTPUTFORMAT
    'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
    LOCATION
    'hdfs://edl/hive/db/tsr'
    TBLPROPERTIES (
    'COLUMN_STATS_ACCURATE'='{\"BASIC_STATS\":\"true\"}',
    'numFiles'='1',
    'numRows'='1869',
    'rawDataSize'='0',
    'serialization.null.format'='',
    'totalSize'='144640',
    'transient_lastDdlTime'='1524479930')

样本输出:

DomainUser1 , 2017-07-04 19:07:27

预期结果:

Domain\User1 , 2017-07-04 19:07:27

我已经尝试将'\\'和'\'用作escapeChar,并且都遇到相同的问题

EDIT 1: I have tried both '\\' and '\' as the escapeChar and both have the same problem

推荐答案

不幸的是,Hive中的csv serde不支持多个字符作为分隔符/引用/转义符,看起来您想使用2个反斜杠作为escapeChar(这不是比 OpenCSVSerde 可能考虑的字符仅支持单个字符作为转义符(实际上它使用的是

Unfortunately the csv serde in Hive does not support multiple characters as separator/quote/escape, it looks like you want to use 2 backlslahes as escapeChar (which is not possible) consideirng than OpenCSVSerde only support a single character as escape (actually it is using CSVReader which only supports one). I am not aware about any other SerDe that supports multiple characters in Hive, you can always implement your own udf with other library, not the most popular option (nobody wants to support its own stuffs :)). I would recommend use a different character as escape, hopefully one not present in your data. A second option would be modify your data during your ingestion to replace \ by \\

这篇关于蜂巢在使用OpenCSVSerde时无法读取字符斜杠的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆