在 Cloudera 中使用 serde 加载 JSON 文件 [英] Loading JSON file with serde in Cloudera

查看:68
本文介绍了在 Cloudera 中使用 serde 加载 JSON 文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用具有此包结构的 JSON 文件:

I am trying to work with a JSON file with this bag structure :

{
   "user_id": "kim95",
   "type": "Book",
   "title": "Modern Database Systems: The Object Model, Interoperability, and Beyond.",
   "year": "1995",
   "publisher": "ACM Press and Addison-Wesley",
   "authors": [
      {
         "name": "null"
      }
   ],
   "source": "DBLP"
}
{
   "user_id": "marshallo79",
   "type": "Book",
   "title": "Inequalities: Theory of Majorization and Its Application.",
   "year": "1979",
   "publisher": "Academic Press",
   "authors": [
      {
         "name": "Albert W. Marshall" 
      },
      {
         "name": "Ingram Olkin"
      }
   ],
   "source": "DBLP"
}

我尝试使用 serde 为 Hive 加载 JSON 数据.我遵循了我在这里看到的两种方式:http://blog.cloudera.com/blog/2012/12/how-to-use-a-serde-in-apache-hive/

I tried to use serde to load JSON data for Hive. I followed both ways that I saw here : http://blog.cloudera.com/blog/2012/12/how-to-use-a-serde-in-apache-hive/

使用此代码:

CREATE EXTERNAL TABLE IF NOT EXISTS serd (
           user_id:string, 
           type:string, 
           title:string,
           year:string,
           publisher:string,
           authors:array<struct<name:string>>,
           source:string)       
    ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
    LOCATION '/user/hdfs/data/book-seded_workings-reduced.json';

我收到此错误:

error while compiling statement: failed: parseexception line 2:17 cannot recognize input near ':' 'string' ',' in column type

我也尝试过这个版本:https://github.com/rcongiu/Hive-JSON-塞尔德

给出了不同的错误:

Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Cannot validate serde: org.openx.data.jsonserde.JsonSerde

有什么想法吗?

我还想知道有什么替代方法可以使用这样的 JSON 来查询作者"中的姓名"字段.是猪还是蜂巢?

I also want to know what are alternatives to work with a JSON like this to make queries on 'name' field in 'authors'. Whether it's Pig or Hive?

我已经将其转换为tsv"文件.但是,由于我的作者列是一个元组,如果我从该文件构建表,我不知道如何使用 Hive 对名称"发出请求.我应该更改我的tsv"转换脚本还是保留它?或者有没有 Hive 或 Pig 的替代品?

I have already converted it in to a "tsv" file. But, since my authors column is a tuple, I don't know how make requests on 'name' with Hive, If I build a table from this file. Should I change my script for "tsv" conversion or keep it? Or are there any alternatives with Hive or Pig?

推荐答案

add jar only add to session 这将不可用,最后出现错误.将 JAR 加载到 Hive 和 Map Reduce 路径上的所有节点上,如下所示,以便 HIVE 和 Map Reduce 组件在被调用时会选择它.

add jar only add to session which won't be available and finally it is getting error. Get the JAR loaded on all the nodes at Hive and Map Reduce path like the below location so that HIVE and Map Reduce component will pick this whenever it’s been called.

  1. /hadoop/CDH_5.2.0_Linux_parcel/parcels/CDH-5.2.0- 1.cdh5.2.0.p0.36/lib/hive/lib/json-serde-1.3.6-jar-with-依赖.jar

  1. /hadoop/CDH_5.2.0_Linux_parcel/parcels/CDH-5.2.0- 1.cdh5.2.0.p0.36/lib/hive/lib/json-serde-1.3.6-jar-with-dependencies.jar

/hadoop/CDH_5.2.0_Linux_parcel/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/lib/hadoop-mapreduce/lib/json-serde-1.3.6-jar-with-dependencies.jar

/hadoop/CDH_5.2.0_Linux_parcel/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/lib/hadoop-mapreduce/lib/json-serde-1.3.6-jar-with-dependencies.jar

注意:此路径因集群而异.

Note: This path varies to cluster.

这篇关于在 Cloudera 中使用 serde 加载 JSON 文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆