如何处理列值中的定界符? [英] How to handle delimiter in column value?

查看:67
本文介绍了如何处理列值中的定界符?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图将CSV文件数据加载到我的Hive表中,但是它在一列的值中具有delimiter(,),因此Hive将其作为定界符并将其加载到新列中.我尝试使用转义序列\,但是我也\(它无法正常工作,并且总是在,.

I am trying to load CSV file data into my Hive table,but but it has delimiter(,) , in one column's value, so Hive is taking it as a delimiter and loading it into a new column. I tried using escape sequence \ but in that I also \ (it its not working and always loading data in new column after , .

我的CSV文件.

        id,name,desc,per1,roll,age
        226,a1,"\"double bars","item1 and item2\"",0.0,10,25
        227,a2,"\"doubles","item2 & item3 item4\"",0.1,20,35
        228,a3,"\"double","item3 & item4 item5\"",0.2,30,45
        229,a4,"\"double","item5 & item6 item7\"",0.3,40,55

我已经更新了我的桌子.

I have updated my table.:

    create table testing(id int, name string, desc string, uqc double, roll int, age int) 
    ROW   FORMAT SERDE 
    'org.apache.hadoop.hive.serde2.OpenCSVSerde'
     WITH SERDEPROPERTIES (
    "separatorChar" = ",",
    "quoteChar" = '"',
    "escapeChar" = "\\" ) STORED AS textfile;

但是,仍然在.之后的另一列中获取数据.

But still I'm getting data in a different column after ,.

我在路径命令中使用加载数据.

I'm using load data in path command.

推荐答案

这是基于RegexSerDe创建表的方法.

This is how to create table based on RegexSerDe.

每列在正则表达式中应具有对应的捕获组().您可以轻松调试regex,而无需使用 regex_replace 创建表:

Each column should have corresponding capturing group () in the regex. You can easily debug regex without creating the table using regex_replace:

select regexp_replace('226,a1,"\"double bars","item1 and item2\"",0.0,10,25',
                      '^(\\d+?),(.*?),"(.*)",([0-9.]*),([0-9]*),([0-9]*).*', --6 groups
                     '$1 $2 $3 $4 $5 $6'); --space delimited fields 

结果:

226 a1 "double bars","item1 and item2" 0.0 10 25

如果看起来不错,请创建表:

If it seems good, create table:

 create external table testing(id int, 
                      name string, 
                      desc string, 
                      uqc double, 
                      roll int, 
                      age int
                     ) 
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' 
WITH SERDEPROPERTIES ('input.regex'='^(\\d+?),(.*?),"(.*)",([0-9.]*),([0-9]*),([0-9]*).*')
location ....
TBLPROPERTIES("skip.header.line.count"="1")
;

阅读此文章以获取更多详细信息.

Read this article for more details.

这篇关于如何处理列值中的定界符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆