使用默认的SerDE加载表格数据 [英] Hive table data load with default SerDE

查看:78
本文介绍了使用默认的SerDE加载表格数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

数据的格式如下:

a,b,c,d,e

p,q,e,r,t

a,s,t,g,t

我想创建一个Hive表格



Col1,Col2,Col3,Col4



a,b ,c,d,e

p,q,e,r,t

a,s,t ,g,t

如上所示,如果数据封装在双引号中,则在创建表数据时不应考虑逗号之间的逗号。如果我使用默认的SerDe,则双引号将被忽略,并且b,c被视为两个单独的列。



如何确保双引号将忽略两个元素之间的逗号(如果封装在双引号中)

解决方案

如果可行且可行,我首先建议您探索一下,看看输入数据是否可以通过除逗号之外的其他字段进行消毒处理。使用数据中自然会出现的分隔模式总是很冒险的。



但是,如果这是不可能的,那么有这种基于正则表达式的方式来检测引用的逗号:


  1. 首先将数据作为单列行(每行中的整行)提取到临时表中。
  2. >
  3. 检测引号之间出现的逗号并将其替换为假冒的占位符。

  4. 用逗号分隔结果字符串。

  5. 用人们最初表示的逗号代替人造的占位符。

作为一个具体的例子,我加载了以下内容单列临时表与您的数据(步骤#1):

  hive>描述分段; 
确定
rawline字符串
所用时间:0.238秒,提取:1行
配置单元> SELECT * FROM staging;
OK
a,b,c,d,e
p,q,e,r,t
a,s,t,g,t
所用时间:0.277秒,提取:3行

以下查询生成最终的目标表。

  DROP TABLE IF EXISTS test; 
CREATE TABLE测试(
Col1 STRING,
Col2 STRING,
Col3 STRING,
Col4 STRING
);
INSERT INTO TABLE测试SELECT
regexp_replace(fields [0],\\ [QUOTEDCOMMA\\\\,,), - Step#4
regexp_replace(字段[1],\\ [QUOTEDCOMMA\\\\,,), - 步骤#4
regexp_replace(fields [2],\\ [QUOTEDCOMMA\\\ \\步骤#4
regexp_replace(fields [3],\\ [QUOTEDCOMMA\\\\\,,) - Step#4
FROM(
SELECT split( - Step#2 and#3
regexp_replace(rawline,\([^,] *),([^,] *)\) ,$ 1 [QUOTEDCOMMA] $ 2),
',')AS字段
FROM分段
)t;

这会生成以下最终表 test

  hive> SELECT * FROM test; 
确定
ab,cde
pqe,rt
ast,gt
所用时间:0.196秒,提取:3行

在这个示例实现中,字符串 [QUOTEDCOMMA] 被用作人造占位符在引号之间找到的逗号。选择是完全随意的,实际上如果你走这条路线,你会想确保你的占位符不会自然地出现在你的数据中。


The data is in the following format

a,"b,c",d, e

p,q,"e,r", t

a,s,"t,g", t

I wanted to create a Hive table

Col1, Col2, Col3 , Col4

a , b,c , d , e

p , q , e,r , t

a , s , t,g , t

As you see above, if the data is encapsulated in double quotes, the comma in between should not be considered in creating the table data. If I use the default SerDe, the double quotes are ignored and b,c is considered as two separate columns.

How do I ensure that the double quote will ignore the comma in between two elements if encapsulated in double quotes

解决方案

If it is possible and feasible, I'd first recommend you explore seeing if the input data can be sanitized in such a way to use something besides commas for field delimiters. It is always dicey to use a delimiting pattern that may naturally occur in your data.

But if that is not possible, then there is this regular expression-based way to detect quoted commas:

  1. First ingest your data into a staging table as single-column rows (the entire line into each row).
  2. Detect commas occurring between quotation marks and replace them with an artificial place holder.
  3. Split the resulting string using commas as delimiters.
  4. Replace the artificial placeholders with the commas that they originally represented.

As a contrived concrete example, I loaded up the following single-column staging table with your data (step #1):

hive> DESCRIBE staging;
OK
rawline                     string
Time taken: 0.238 seconds, Fetched: 1 row(s)
hive> SELECT * FROM staging;
OK
a,"b,c",d, e
p,q,"e,r", t
a,s,"t,g", t
Time taken: 0.277 seconds, Fetched: 3 row(s)

The following query then generates the final target table.

DROP TABLE IF EXISTS test;
CREATE TABLE test (
    Col1 STRING,
    Col2 STRING,
    Col3 STRING,
    Col4 STRING
  );
INSERT INTO TABLE test SELECT
  regexp_replace(fields[0], "\\[QUOTEDCOMMA\\]", ","),  -- Step #4
  regexp_replace(fields[1], "\\[QUOTEDCOMMA\\]", ","),  -- Step #4
  regexp_replace(fields[2], "\\[QUOTEDCOMMA\\]", ","),  -- Step #4
  regexp_replace(fields[3], "\\[QUOTEDCOMMA\\]", ",")   -- Step #4
FROM (
  SELECT split(  -- Step #2 and #3
    regexp_replace(rawline, "\"([^,]*),([^,]*)\"", "$1[QUOTEDCOMMA]$2"),
    ',') AS fields
  FROM staging
) t;

This generates the following final table test:

hive> SELECT * FROM test;
OK
a   b,c     d        e
p   q       e,r      t
a   s       t,g      t
Time taken: 0.196 seconds, Fetched: 3 row(s)

In this sample implementation, the string [QUOTEDCOMMA] is being used as the artificial placeholder for a comma that is found between quotation marks. The choice was completely arbitrary, and in practice if you go this route you will want to make sure your placeholder does not naturally occur within your data.

这篇关于使用默认的SerDE加载表格数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆