Hive用于复杂的嵌套Json [英] Hive for complex nested Json

查看:172
本文介绍了Hive用于复杂的嵌套Json的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

  {key :somehashvalue,columns:[[Event:2014-03-26 00 \\:29 \\:13 + 0200:json,{\user \:{ \ credType\:\ ADDRESS\,\ credValue\:\ 01:AA:A4:G1:HH:UU\,\ cAgent\:空,\ cType\:\ ACE\},\ timestamp\:1395786553,\ sessionId\:1395785353,\ className\:\ Event\ \ subtype\:\ CURRENTLYACTIVE\,\ vType\:\ TEST\,\ vId\:1235080,\ eType \:\CURRENTLYACTIVE \,\eData \:\1 \},1395786553381001],[Event:2014-03-26 00 \\\:29 \\:13 + 0200:,,1395786553381001]]} 

使用Json serde s解析上面的json到我的配置单元列。但是,上面的1395786553381001不存在SerDe可以映射到Hive列的格式,即此值不存在Key(因为Hive理解以下Json列/值:)

因此,我采用了数组类型的方法并创建了一个表 -

  CREATE TABLE mytesttable(
key string ,
列数组< array< string>>

ROW FORMAT SERDE'org.openx.data.jsonserde.JsonSerDe';

LOAD DATA LOCAL INPATH'/home/user/testsample.json'
OVERWRITE INTO TABLE mytesttable;

从mytesttable中选择列[0] [1];
给出 -

  {user:{credType:ADDRESS,credValue:01 :AA:A4:G1:HH:UU, cAgent:空, CTYPE: ACE}, 时间戳:1395786553 的sessionId:1395785353, 类名: 事件, 亚型 :CURRENTLYACTIVE,vType:TEST,vId:1235080,eType:CURRENTLYACTIVE,eData:1} 

上面显示干净,但是我还需要列[*] [2],即在Json hive列​​中进行进一步转换。



我编写了一个正则表达式配置单元查询来清理存在于'/ home / user / testsample.json'中的原始Json(假设它出现在表中tablewithinputjson)

  SELECT 
REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(ij.columna,'[] [{]', '{'),'[}] [']'','}'),'''''')作为columna
从tablewithinputjson ij;

上述查询返回 -

  {key:somehashvalue,columns:[[Event:2014-03-26 00:29:13 + 0200:json,{user:{credType : 地址, credValue: 01:AA:A4:G1:HH:UU, cAgent:空, CTYPE: ACE}, 时间戳:1395786553 的sessionId:1395785353 类名: 事件, 子类型: CURRENTLYACTIVE, V型: TEST, VID:1235080, ETYPE: CURRENTLYACTIVE, EDATA: 1},1395786553381001] ,[Event:2014-03-26 00:29:13 + 0200:,,1395786553381001]]} 

但是在这里,1395786553381001不能映射到配置单元列,因为它出现在后面,而不是后面:或者更具体地说,这个值在没有密钥的情况下出现。
(我可以添加测试:在1395786553381001之前,但我不想自定义输入数据 - 因为a)太多的定制是我不舒服的东西b)似乎不是一个好的解决方案c )这将不必浪费我的hadoop集群空间和时间)

不要混淆任何进一步,我不能提出一个完全解析和映射的Hive表格格式原始Json片段中的所有字段。
欢迎任何建议。请让我知道如果它看起来太混乱。

解决方案

发布端到端解决方案。逐步过程将JSON转换为配置表格:



步骤1)安装maven,如果尚未存在



> $ sudo apt-get install maven



步骤2)安装git,如果没有的话



> sudo git clone https://github.com/rcongiu/Hive-JSON-Serde.git



步骤3)进入$ HOME / HIVE-JSON_Serde文件夹步骤4)构建serde软件包



> sudo mvn -Pcdh5 clean package



步骤5) serde文件将位于
$ HOME / Hive-JSON-Serde / json-serde / target / json-serde-1.3.7-SNAPSHOT-jar -with-dependencies.jar



步骤6)将serde添加为配置单元中的依赖关系jar

  hive>添加JAR $ HOME / Hive-JSON-Serde / json-serde / target / json-serde-1.3.7- SNAPSHOT-jar -with-dependencies.jar; 

步骤7)在$ HOME / books.json中创建json文件(示例)

  {value:[{id:1,bookname:A,properties :1year,unit:3}},{id:2,bookname:B,properties:{subscription:2years,unit 5)}}]} 

步骤8)在配置单元中创建tmp1表

  hive> CREATE TABLE tmp1(
value ARRAY< struct< id:string,bookname:string,properties:struct< subscription:string,unit:字符串>>>

ROW FORMAT SERDE'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES(
'mapping.value'='value'

保存为TEXTFILE;

步骤9)将数据从json载入到tmp1表中

 > LOAD DATA LOCAL INPATH'$ HOME / books.json'INTO TABLE tmp1; 

第10步)创建一个tmp2表来做爆炸操作表单tmp1,这个中间步骤是打破多重请注意:如果您的JSON结构很简单且单一级别,请避免此步骤

  hive> ;将表tmp2创建为
SELECT *
FROM tmp1
LATERAL VIEW explode(value)itemTable AS items;

步骤11)创建配置表格并从tmp2表格加载值

  hive>将表格书籍创建为
选择值[0] .id作为标识,值[0] .bookname作为名称,值[0]。作为订阅的properties.subscription,值为[0] .properties.unit作为来自tmp2的单元;

步骤12)drop tmp表格

  hive> drop table tmp1; 
hive> drop table tmp2;

步骤13)测试配置单元表

  hive> select * from books; 

输出:



id名称订阅单元

1 B 1年3

2 B 2年5

I have an original input json snippet ('/home/user/testsample.json') -

{"key": "somehashvalue","columns": [["Event:2014-03-26 00\\:29\\:13+0200:json","{\"user\":{\"credType\":\"ADDRESS\",\"credValue\":\"01:AA:A4:G1:HH:UU\",\"cAgent\":null,\"cType\":\"ACE\"},\"timestamp\":1395786553,\"sessionId\":1395785353,\"className\":\"Event\",\"subtype\":\"CURRENTLYACTIVE\",\"vType\":\"TEST\",\"vId\":1235080,\"eType\":\"CURRENTLYACTIVE\",\"eData\":\"1\"}",1395786553381001],["Event:2014-03-26 00\\:29\\:13+0200:","",1395786553381001]]}

I tried to use Json serde s to parse the above json to my hive columns. However, 1395786553381001 above is not present in a format which SerDe can map to a Hive column i.e this value is present without a Key (since Hive understands Json columns/values present after :)

So instead I took the Array type approach and created a table -

CREATE TABLE mytesttable (
  key string, 
  columns array < array< string > >
  )
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe';

LOAD DATA LOCAL INPATH '/home/user/testsample.json'
OVERWRITE INTO TABLE mytesttable;

select columns[0][1] from mytesttable; gives -

{"user":{"credType":"ADDRESS","credValue":"01:AA:A4:G1:HH:UU","cAgent":null,"cType":"ACE"},"timestamp":1395786553,"sessionId":1395785353,"className":"Event","subtype":"CURRENTLYACTIVE","vType":"TEST","vId":1235080,"eType":"CURRENTLYACTIVE","eData":"1"}

The above appears clean, but then I also need columns[*][2] i.e in a Json hive column for further transformations.

I wrote a regex hive query to cleanse the original Json present in '/home/user/testsample.json' (assume it is present in a table tablewithinputjson)

SELECT
REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(ij.columna, '["][{]', '{'),'[}]["]', '}'), '\\\\', '') AS columna
FROM tablewithinputjson ij;

The above query returns -

{"key": "somehashvalue","columns": [["Event:2014-03-26 00:29:13+0200:json",{"user":{"credType":"ADDRESS","credValue":"01:AA:A4:G1:HH:UU","cAgent":null,"cType":"ACE"},"timestamp":1395786553,"sessionId":1395785353,"className":"Event","subtype":"CURRENTLYACTIVE","vType":"TEST","vId":1235080,"eType":"CURRENTLYACTIVE","eData":"1"},1395786553381001],["Event:2014-03-26 00:29:13+0200:","",1395786553381001]]}

But here again, 1395786553381001 cannot be mapped to a hive column since it appears after , and not after : or more specifically this value is present without a key. (I could add "test": before 1395786553381001 , but I do not want to customize the input data - since a) Too much customization is something I am not comfortable with b) does not seem to be a good solution c) it would unnecessary waste my hadoop cluster space and time)

Not to confuse any further, I am not able to come up with a Hive table format that fully parses and maps all the fields in the original Json snippet. Any suggestions are welcome. Please let me know If it seems too confusing.

解决方案

posting End-to-End solution. Step by step procedure to convert JSON to hive table:

step 1) install maven if not there already

>$ sudo apt-get install maven

step 2) install git if not there already

>sudo git clone https://github.com/rcongiu/Hive-JSON-Serde.git

step 3) go into the $HOME/HIVE-JSON_Serde folder

step 4) build the serde package

>sudo mvn -Pcdh5 clean package

step 5) The serde file will be in $HOME/Hive-JSON-Serde/json-serde/target/json-serde-1.3.7-SNAPSHOT-jar-with-dependencies.jar

step 6) Add serde as dependency jar in hive

 hive> ADD JAR $HOME/Hive-JSON-Serde/json-serde/target/json-serde-1.3.7- SNAPSHOT-jar-with-dependencies.jar;

step 7) create json file in $HOME/books.json (Example)

{"value": [{"id": "1","bookname": "A","properties": {"subscription": "1year","unit": "3"}},{"id": "2","bookname":"B","properties":{"subscription": "2years","unit": "5"}}]}

step 8) create tmp1 table in hive

 hive>CREATE TABLE tmp1 (
      value ARRAY<struct<id:string,bookname:string,properties:struct<subscription:string,unit:string>>>   
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES ( 
    'mapping.value' = 'value'   
) 
STORED AS TEXTFILE;

step 9) load the data from json to tmp1 table

>LOAD DATA LOCAL INPATH '$HOME/books.json' INTO TABLE tmp1;

step 10) create a tmp2 table to do explode operation form tmp1, this intermediate step is to break multi level json structure into multiple rows Note: if your JSON structure is simple and single level , avoid this step

hive>create table tmp2 as 
 SELECT *
 FROM tmp1
 LATERAL VIEW explode(value) itemTable AS items;

step 11) create hive table and load the values from tmp2 table

hive>create table books as 
select value[0].id as id, value[0].bookname as name, value[0].properties.subscription as subscription, value[0].properties.unit as unit from tmp2;

step 12) drop tmp tables

hive>drop table tmp1;
hive>drop table tmp2;

step 13) test the hive table

hive>select * from books;

output:

id name subscription unit

1 B 1year 3

2 B 2years 5

这篇关于Hive用于复杂的嵌套Json的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆