使用jsonserde在配置单元中加载复杂的json [英] Load complex json in hive using jsonserde

查看:159
本文介绍了使用jsonserde在配置单元中加载复杂的json的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图在跟踪json的配置单元中建立表格

  {
business_id:vcNAWiLM4dR7D2nwwJ7nCA ,
hours:{
Tuesday:{
close:17:00,
open:08:00

Friday:{
close:17:00,
open:08:00
}
},
open:true,
categories:[
医生,
健康与医疗
],
review_count :
name:Eric Goldberg,MD,
neighborhoods:[],
attributes:{
By Appointment Only
接受信用卡:true,
Good For Groups:1
},
type:business
}

我可以使用以下DDL创建一个表,但是在查询该表时,我得到一个异常。

  CREATE TABLE IF NOT EXISTS业务(
bus iness_id字符串,
小时图<字符串,字符串> ;,
开放布尔值,
类别数组<字符串> ;,
review_count int,
字符串,
邻域数组< string> ;,
属性映射<字符串,字符串> ;,
类型字符串

ROW FORMAT SERDE'org.apache.hadoop.hive.contrib.serde2.JsonSerde ;

检索数据时的异常是ClassCast:不能将jsoanarray转换为json对象。这个json的正确模式是什么?是否有任何一个可以帮助我生成正确的架构给定json与jsonserde一起使用?

问题是小时,您定义为小时图< string,string> 但应该是 map< string,map< string,string> 改为

有一个工具可以用来自动生成配置单元表定义JSON数据: https://github.com/quux00/hive-json-schema



但您可能需要对其进行调整,因为遇到JSON对象({}之间的任何内容)时,该工具无法知道它是否将其转换为配置单元 map 或一个 struct
在你的数据上,这个工具给了我这个:

pre $ CREATE TABLE x
struct struct<信用卡:布尔型,
仅限预约:布尔型,适用于groups:int> ;,
business_id string,
categories数组< string> ;,
hours map< string:struct< close:string,open:string>
名称字符串,
邻域数组< string> ;,
开放布尔值,
review_count int,
字符串

但它看起来像是你想要的东西:

  CREATE TABLE x(
属性映射<字符串,字符串> ;,
business_id字符串,
类别数组<字符串> ;,
小时图< string,struct< close:string,open:string>> ;,
名称字符串,
邻域数组< string> ;,
开放布尔值,
review_count int,
类型字符串
)行格式SERDE'org.openx.data.jsonserde.JsonSerDe'
保存为TEXTFILE;

hive>加载数据本地inpath'json.data'覆盖到表x中;
hive> Table default.x stats:[numFiles = 1,numRows = 0,totalSize = 416,rawDataSize = 0]
OK
hive>从x选择*;
OK
{接受信用卡:true,by appointment only:true,
适用于团体:1}
vcNAWiLM4dR7D2nwwJ7nCA
[医生,健康与医疗]
{tuesday:{close:17:00,open:08:00},
friday:{close:17:00,open:08:00}}
Eric Goldberg,MD [HELLO] true 9 business
花费的时间:0.335秒,提取:1行
hive>

尽管如此:


  • 注意我使用了不同的JSON SerDe,因为我没有在我的系统上使用您使用的那个。我使用了这一个,我更喜欢它,因为我写了它。但是create语句应该与其他serde一样。

  • 您可能希望将其中一些映射转换为结构,因为它们可能更便于查询。例如, attributes 可能是一个结构体,但是您需要将这些名称与它们之间的空格对应,如接受信用卡。我的SerDe允许将json属性映射到不同的配置单元列名称。这也是需要的,然后JSON使用一个属性,这个属性是一个hive关键字,如'timestamp'或'create'。


I am trying to build a table in hive for following json

{
    "business_id": "vcNAWiLM4dR7D2nwwJ7nCA",
    "hours": {
        "Tuesday": {
            "close": "17:00",
            "open": "08:00"
        },
        "Friday": {
            "close": "17:00",
            "open": "08:00"
        }
    },
    "open": true,
    "categories": [
        "Doctors",
        "Health & Medical"
    ],
    "review_count": 9,
    "name": "Eric Goldberg, MD",
    "neighborhoods": [],
    "attributes": {
        "By Appointment Only": true,
        "Accepts Credit Cards": true, 
        "Good For Groups": 1
    },
    "type": "business"
}

I can create a table using following DDL,however I get an exception while querying that table.

CREATE TABLE IF NOT EXISTS business (
 business_id string,
 hours map<string,string>,
 open boolean,
 categories array<string>,
 review_count int,
 name string,
 neighborhoods array<string>,
 attributes map<string,string>,
 type string
 )
 ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.JsonSerde';

The exception while retrieving data is "ClassCast:Cant cast jsoanarray to json object" . What is the correct schema for this json? Is there any took which can help me generate correct schema for given json to be used with jsonserde?

解决方案

It looks to me that the problem is hours which you defined as hours map<string,string> but should be a map<string,map<string,string> instead.

There's a tool you can use to generate the hive table definition automatically from your JSON data: https://github.com/quux00/hive-json-schema

but you may want to adjust it because when encountering a JSON Object (Anything between {} ) the tool can't know wether to translate it to a hive map or to a struct. On your data, the tool gives me this:

CREATE TABLE x (
 attributes struct<accepts credit cards:boolean, 
       by appointment only:boolean, good for groups:int>,
 business_id string,
 categories array<string>,
 hours map<string:struct<close:string, open:string>
 name string,
 neighborhoods array<string>,
 open boolean,
 review_count int,
 type string
)

but it looks like you want something like this:

CREATE TABLE x (
     attributes map<string,string>,
     business_id string,
     categories array<string>,
     hours map<string,struct<close:string, open:string>>,
     name string,
     neighborhoods array<string>,
     open boolean,
     review_count int,
     type string
    ) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
STORED AS TEXTFILE;

hive> load data local inpath 'json.data'  overwrite into  table x;
hive> Table default.x stats: [numFiles=1, numRows=0, totalSize=416,rawDataSize=0]
OK
hive> select * from x;
OK
{"accepts credit cards":"true","by appointment only":"true",
  "good for groups":"1"}    
  vcNAWiLM4dR7D2nwwJ7nCA    
  ["Doctors","Health & Medical"]    
  {"tuesday":{"close":"17:00","open":"08:00"},
   "friday":{"close":"17:00","open":"08:00"}}   
    Eric Goldberg, MD   ["HELLO"]   true    9   business
Time taken: 0.335 seconds, Fetched: 1 row(s)
hive>

A few notes though:

  • Notice I used a different JSON SerDe because I don't have on my system the one you used. I used this one, I like it better because, well, I wrote it. But the create statement should work just as well with the other serde.
  • You may want to convert some of those maps to structs, as they may be more convenient to query. For instance, attributes could be a struct, but you'd need to map the names with a space in them like accepts credit cards. My SerDe allows to map a json attribute to a different hive column name. That is also needed then JSON uses an attribute that is a hive keyword like 'timestamp' or 'create'.

这篇关于使用jsonserde在配置单元中加载复杂的json的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆