在Cloudera中加载带有serde的JSON文件 [英] Loading JSON file with serde in Cloudera

查看:268
本文介绍了在Cloudera中加载带有serde的JSON文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

  {
user_id:

我试图使用这个包结构的JSON文件: kim95,
type:Book,
title:现代数据库系统:对象模型,互操作性和超越。,
year:1995 ,
publisher:ACM Press和Addison-Wesley,
authors:[
{
name:null
}
],
source:DBLP
}
{
user_id:marshallo79,
type:Book,
title:不平等:优化理论及其应用,
year:1979,
出版社:学术出版社,
作者:[
{
name:Albert W. Marshall
},
{
name:Ingram Olkin
}

source:DBLP
}

我尝试使用serde为Hive加载JSON数据。我遵循我在这里看到的两种方式: http://blog.cloudera.com/blog/2012/12/how-to-use-a-serde-in-apache-hive/



$ p
$ b

  CREATE EXTERNAL TABLE IF NOT EXISTS serd(
user_id:string,
类型:字符串,
标题:字符串,
年:字符串,$ b $发布者:字符串,$ b $作者:数组< struct< name:string>> ;,
source:string)
ROW FORMAT SERDE'org.openx.data.jsonserde.JsonSerDe'
LOCATION'/user/hdfs/data/book-seded_workings-reduced.json';

我得到这个错误:

<$ p $编译语句时出错:失败:parseexception行2:17无法识别':'字符串''附近的输入,在列类型

我试过这个版本: https: //github.com/rcongiu/Hive-JSON-Serde



给出了不同的错误:

 处理语句时出错:FAILED:执行错误,从org.apache.hadoop.hive.ql.exec.DDLTask返回代码1。无法验证serde:org.openx.data.jsonserde.JsonSerde 

有什么想法吗?



我也想知道使用这样的JSON来处理'authors'中'name'字段的替代方法。无论是Pig还是Hive?



我已经将它转换为tsv文件。但是,由于我的作者列是一个元组,我不知道如何使用Hive在'name'上发出请求,如果我从这个文件构建一个表。我应该改变我的脚本tsv转换还是保留它?或者是否有任何Hive或Pig的替代品?

解决方案

只添加jar会话,不会提供,最后它正在出错。
在Hive和Map Reduce路径中的所有节点上加载JAR,就像下面的位置一样,这样HIVE和Map Reduce组件就会在调用它时选择它。


  1. /hadoop/CDH_5.2.0_Linux_parcel/parcels/CDH-5.2.0- 1.cdh5.2.0.p0.36 / lib / hive / lib / json-serde-1.3。 6-jar-with-dependencies.jar


  2. /hadoop/CDH_5.2.0_Linux_parcel/parcels/CDH-5.2.0-1.cdh5.2.0.p0 .36 / lib / hadoop-mapreduce / lib / json-serde-1.3.6-jar -with-dependencies.jar


注意:此路径因集群而异。


I am trying to work with a JSON file with this bag structure :

{
   "user_id": "kim95",
   "type": "Book",
   "title": "Modern Database Systems: The Object Model, Interoperability, and Beyond.",
   "year": "1995",
   "publisher": "ACM Press and Addison-Wesley",
   "authors": [
      {
         "name": "null"
      }
   ],
   "source": "DBLP"
}
{
   "user_id": "marshallo79",
   "type": "Book",
   "title": "Inequalities: Theory of Majorization and Its Application.",
   "year": "1979",
   "publisher": "Academic Press",
   "authors": [
      {
         "name": "Albert W. Marshall" 
      },
      {
         "name": "Ingram Olkin"
      }
   ],
   "source": "DBLP"
}

I tried to use serde to load JSON data for Hive. I followed both ways that I saw here : http://blog.cloudera.com/blog/2012/12/how-to-use-a-serde-in-apache-hive/

With this code :

CREATE EXTERNAL TABLE IF NOT EXISTS serd (
           user_id:string, 
           type:string, 
           title:string,
           year:string,
           publisher:string,
           authors:array<struct<name:string>>,
           source:string)       
    ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
    LOCATION '/user/hdfs/data/book-seded_workings-reduced.json';

I got this error:

error while compiling statement: failed: parseexception line 2:17 cannot recognize input near ':' 'string' ',' in column type

I alson tried this version : https://github.com/rcongiu/Hive-JSON-Serde

which gave a different error :

Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Cannot validate serde: org.openx.data.jsonserde.JsonSerde

Any idea?

I also want to know what are alternatives to work with a JSON like this to make queries on 'name' field in 'authors'. Whether it's Pig or Hive?

I have already converted it in to a "tsv" file. But, since my authors column is a tuple, I don't know how make requests on 'name' with Hive, If I build a table from this file. Should I change my script for "tsv" conversion or keep it? Or are there any alternatives with Hive or Pig?

解决方案

add jar only add to session which won't be available and finally it is getting error. Get the JAR loaded on all the nodes at Hive and Map Reduce path like the below location so that HIVE and Map Reduce component will pick this whenever it’s been called.

  1. /hadoop/CDH_5.2.0_Linux_parcel/parcels/CDH-5.2.0- 1.cdh5.2.0.p0.36/lib/hive/lib/json-serde-1.3.6-jar-with-dependencies.jar

  2. /hadoop/CDH_5.2.0_Linux_parcel/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/lib/hadoop-mapreduce/lib/json-serde-1.3.6-jar-with-dependencies.jar

Note: This path varies to cluster.

这篇关于在Cloudera中加载带有serde的JSON文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆