Json与猪中的大象鸟解析 [英] Json parse with elephantbird in Pig

查看:163
本文介绍了Json与猪中的大象鸟解析的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我无法获得以下数据在Pig中解析。这是Twitter API在收到特定用户的所有推文后返回的内容。



源数据:(我删除了一些数字以防止意外侵入任何人的隐私)

  [{created_at:Sat Nov 01 23:15:45 +0000 2014,id:5286804225,id_str :5286864225,文字:@Beace_你的nan让我发笑了一些她出来的东西,来源:\\\,截断:false,in_reply_to_status_id:52812992878592 in_reply_to_status_id_str: 522, in_reply_to_user_id:398098 in_reply_to_user_id_str: 3, in_reply_to_screen_name: 要_, 用户:{ ID:425, ID_STR: 42433395, 名: SAINS, SCREEN_NAME: SA3, 位置: 林肯, profile_location:空, 说明: , URL:空, 实体:{说明:{ 网址 :[]}}, 受保护 :假 FOLLOWERS_COUNT :92, FRIENDS_COUNT :526, LIS ted_count:0,created_at:Mon May 25 16:18:05 +0000 2009,favourites_count:6,utc_offset:0,time_zone:伦敦,geo_enabled:true,验证 :假的, statuses_count :19, 郎 : 恩  contributors_enabled:假的, is_translator:假的, is_translation_enabled:假的, profile_background_color: EDECE9, profile_background_image_url: HTTP:\ / \ / abs.twimg.com\ / images\ / themes\ / theme3\ / bg.gif  profile_background_image_url_https: HTTPS:\ / \ / abs.twimg。 com\ / images\ / themes\ / theme3\ / bg.gif  profile_background_tile:假的, profile_image_url: HTTP:\ / \ / pbs.twimg.com\ / profile_images\\ \\ / 52016\ / DGDCj67z_normal.jpeg  profile_image_url_https: HTTPS:\ / \ / pbs.twimg.com\ / profile_images\ / 526\ / DGDCj67z_normal.jpeg, profile_banner_url: HTTPS:\ / \ / pbs.twimg.com\ / profile_banners\ / 424395\ / 13743515\" , profile_link_color: 088253, profile_sidebar_border_color: D3D2CF,profile_si debar_fill_color : E3E2DE, profile_text_color: 634047, profile_use_background_image:真实的, DEFAULT_PROFILE:假的, default_profile_image:假的, 以下:假的, follow_request_sent:假的, 通知:假的}, 地理位置:NULL, 坐标:空, 地点:NULL, 贡献者:NULL, retweet_count:0 FAVORITE_COUNT:1, 实体:{ #标签:[], symbols:[],user_mentions:[{screen_name:e _,name:\\\☁\\\️ effy,id:3998,id_str:398 index:[0,15]}],urls:[]},favited:false,retweeted:false,lang:en},{another one goes here ....} ] 

我已经尝试了很多东西,但是这是当前的代码:

  REGISTER'hdfs:///user/cloudera/elephant-bird-pig-4.1.jar'; 
REGISTER'hdfs:///user/cloudera/elephant-bird-core-4.1.jar';
REGISTER'hdfs:///user/cloudera/elephant-bird-hadoop-compat-4.1.jar';

--Load Json
loadJson = LOAD'/ user / cloudera / tweetwall'使用com.twitter.elephantbird.pig.load.JsonLoader()AS(json:map []);

描述loadJson;

--dump loadJson;

--PARSING JSON
--txt
--a = FOREACH loadJson GENERATE json#'text'AS ParsedInput;

dump loadJson;

c = FOREACH loadJson GENERATE flatten(json#'text')as(m:map []);

如果我没有得到错误,我只是没有返回(如在0字节后返回脚本运行完毕)



例如:

 成功! 

输入:
成功读取/ user / cloudera / tweetwall中的0条记录(532459字节)

输出:
成功存储了0条记录:hdfs://quickstart.cloudera:8020 / tmp / temp-988640258 / tmp-846532109

计数器:
记录的总记录数:0
写入的总字节数:0
Spillable内存管理器溢出次数:0
主动溢出的总数:0
主动溢出的总记录数:0
/ pre>

解决方案

  1。您需要为您的输入json提供根名称
我添加了tweets作为根名称
{tweets:[< your input>]}

2 。这是嵌套的json,所以你需要在加载器中加载你的json文件'nested'选项

input.json

  {tweets:[{created_at:Sat Nov 01 23:15:45 +0000 2014,id:5286804225,id_str:5286864225,text:@Beace_你的nan让我发笑了一些她出来的东西,source :\\\Twitter for iPhone\\\<\\ \\ / a\\\>, 截短的:假 in_reply_to_status_id:52812992878592 in_reply_to_status_id_str: 522, in_reply_to_user_id:398098 in_reply_to_user_id_str: 3, in_reply_to_screen_name: Be_, 用户:{ ID:425, ID_STR: 42433395, 名: SAINS, SCREEN_NAME: SA3, 定位: 林肯, profile_location:空, 说明: , URL:空, 实体:{ 描述:{ 网址:[]}}, 受保护:假 FOLLOWERS_COUNT:92, FRIENDS_COUNT:526 ,listed_count:0,created_at:Mon May 25 16:18:05 +0000 2009,favourites_count:6,utc_offset:0,time_zone:伦敦,geo_enabled:true , 验证:假的, statuses_count:19, 郎: 恩, contributors_enabled:假的, is_translator:假的, is_translation_enabled:假的, profile_background_color: EDECE9, profile_background_image_url : HTTP:\ / \ / abs.twimg.com\ / images\ / themes\ / theme3\ / bg.gif, profile_background_image_url_https:HTTPS:\ / \ / ABS。 twimg.com\ / images\ / themes\ / theme3\ / bg.gif  profile_background_tile:假的, profile_image_url: HTTP:\ / \ / pbs.twimg.com\ / profile_images\ / 52016\ / DGDCj67z_normal.jpeg, profile_image_url_https: HTTPS:\ / \ / pbs.twimg.com\ / profile_images\ / 526\ / DGDCj67z_normal.jpeg, profile_banner_url :HTTPS:\ / \ / pbs.twimg.com \\ / profile_banners\ / 424395\ / 13743515\" , profile_link_color: 088253, profile_sidebar_border_color: D3D2CF, profile_sidebar_fill_color: E3E2DE, profile_text_color: 634047, profile_use_background_image:真DEFAULT_PROFILE:假的, default_profile_image:假的, 以下:假的, follow_request_sent:假的, 通知:假}, 地理:空, 坐标:空, 地方:空, 贡献者:NULL, retweet_count:0 FAVORITE_COUNT:1, 实体:{ #标签:[] 符号:[], user_mentions:[{ SCREEN_NAME: E_ ,name:\\\☁\\\️ effy,id:3998,id_str:398,indices:[0,15]}],urls:[]},偏好:false,retweeted:false,lang:en}]} 

PigScript:

  REGISTER'/tmp/json-simple-1.1.jar'; 
REGISTER'/tmp/elephant-bird-hadoop-compat-4.1.jar';
REGISTER'/tmp/elephant-bird-pig-4.1.jar';
$ b $ loadJson = LOAD'input.json'USING com.twitter.elephantbird.pig.load.JsonLoader(' - nestedLoad')AS(json:map []);
B = FOREACH loadJson GENERATE flatten(json#'tweets')as(m:map []);
C = FOREACH B GENERATE FLATTEN(m#'text');
转储C;

输出:
(@Beace_你的nan让我笑了出来的一些东西)


I can't get the following data to parse in Pig. It's what the twitter API returns after getting all tweets from a certain user.

source data: (I removed some numbers to not invade on anyone's privacy by accident)

[{"created_at":"Sat Nov 01 23:15:45 +0000 2014","id":5286804225,"id_str":"5286864225","text":"@Beace_ your nan makes me laugh with some of the things she comes out with","source":"\u003ca href=\"http:\/\/twitter.com\/download\/iphone\" rel=\"nofollow\"\u003eTwitter for iPhone\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":52812992878592,"in_reply_to_status_id_str":"522","in_reply_to_user_id":398098,"in_reply_to_user_id_str":"3","in_reply_to_screen_name":"Be_","user":{"id":425,"id_str":"42433395","name":"SAINS","screen_name":"sa3","location":"Lincoln","profile_location":null,"description":"","url":null,"entities":{"description":{"urls":[]}},"protected":false,"followers_count":92,"friends_count":526,"listed_count":0,"created_at":"Mon May 25 16:18:05 +0000 2009","favourites_count":6,"utc_offset":0,"time_zone":"London","geo_enabled":true,"verified":false,"statuses_count":19,"lang":"en","contributors_enabled":false,"is_translator":false,"is_translation_enabled":false,"profile_background_color":"EDECE9","profile_background_image_url":"http:\/\/abs.twimg.com\/images\/themes\/theme3\/bg.gif","profile_background_image_url_https":"https:\/\/abs.twimg.com\/images\/themes\/theme3\/bg.gif","profile_background_tile":false,"profile_image_url":"http:\/\/pbs.twimg.com\/profile_images\/52016\/DGDCj67z_normal.jpeg","profile_image_url_https":"https:\/\/pbs.twimg.com\/profile_images\/526\/DGDCj67z_normal.jpeg","profile_banner_url":"https:\/\/pbs.twimg.com\/profile_banners\/424395\/13743515","profile_link_color":"088253","profile_sidebar_border_color":"D3D2CF","profile_sidebar_fill_color":"E3E2DE","profile_text_color":"634047","profile_use_background_image":true,"default_profile":false,"default_profile_image":false,"following":false,"follow_request_sent":false,"notifications":false},"geo":null,"coordinates":null,"place":null,"contributors":null,"retweet_count":0,"favorite_count":1,"entities":{"hashtags":[],"symbols":[],"user_mentions":[{"screen_name":"e_","name":"\u2601\ufe0f effy","id":3998,"id_str":"398","indices":[0,15]}],"urls":[]},"favorited":false,"retweeted":false,"lang":"en"}, {another one goes here ....} ]

I have tried a lot of things but this is the current code I have:

REGISTER 'hdfs:///user/cloudera/elephant-bird-pig-4.1.jar';
REGISTER 'hdfs:///user/cloudera/elephant-bird-core-4.1.jar';
REGISTER 'hdfs:///user/cloudera/elephant-bird-hadoop-compat-4.1.jar';

--Load Json
loadJson =  LOAD '/user/cloudera/tweetwall' USING com.twitter.elephantbird.pig.load.JsonLoader() AS (json:map []);

describe loadJson;

--dump loadJson;

--PARSING JSON
--txt
--a = FOREACH loadJson GENERATE json#'text' AS ParsedInput;

dump loadJson;

c = FOREACH loadJson GENERATE flatten(json#'text') as (m:map[]);

If I'm not getting erros, I just get no returns (as in 0 bytes returned after the script is done running)

for instance:

success!

Input(s):
Successfully read 0 records (532459 bytes) from: "/user/cloudera/tweetwall"

Output(s):
Successfully stored 0 records in: "hdfs://quickstart.cloudera:8020/tmp/temp-988640258/tmp-846532109"

Counters:
Total records written : 0
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0

解决方案

1. You need to give the root name for your input json
    I added "tweets" as your root name
    {"tweets":[<your input>]}

2. This is nested json, so you need to load your json file with 'nested' option in the loader

input.json

{"tweets":[{"created_at":"Sat Nov 01 23:15:45 +0000 2014","id":5286804225,"id_str":"5286864225","text":"@Beace_ your nan makes me laugh with some of the things she comes out with","source":"\u003ca href=\"http:\/\/twitter.com\/download\/iphone\" rel=\"nofollow\"\u003eTwitter for iPhone\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":52812992878592,"in_reply_to_status_id_str":"522","in_reply_to_user_id":398098,"in_reply_to_user_id_str":"3","in_reply_to_screen_name":"Be_","user":{"id":425,"id_str":"42433395","name":"SAINS","screen_name":"sa3","location":"Lincoln","profile_location":null,"description":"","url":null,"entities":{"description":{"urls":[]}},"protected":false,"followers_count":92,"friends_count":526,"listed_count":0,"created_at":"Mon May 25 16:18:05 +0000 2009","favourites_count":6,"utc_offset":0,"time_zone":"London","geo_enabled":true,"verified":false,"statuses_count":19,"lang":"en","contributors_enabled":false,"is_translator":false,"is_translation_enabled":false,"profile_background_color":"EDECE9","profile_background_image_url":"http:\/\/abs.twimg.com\/images\/themes\/theme3\/bg.gif","profile_background_image_url_https":"https:\/\/abs.twimg.com\/images\/themes\/theme3\/bg.gif","profile_background_tile":false,"profile_image_url":"http:\/\/pbs.twimg.com\/profile_images\/52016\/DGDCj67z_normal.jpeg","profile_image_url_https":"https:\/\/pbs.twimg.com\/profile_images\/526\/DGDCj67z_normal.jpeg","profile_banner_url":"https:\/\/pbs.twimg.com\/profile_banners\/424395\/13743515","profile_link_color":"088253","profile_sidebar_border_color":"D3D2CF","profile_sidebar_fill_color":"E3E2DE","profile_text_color":"634047","profile_use_background_image":true,"default_profile":false,"default_profile_image":false,"following":false,"follow_request_sent":false,"notifications":false},"geo":null,"coordinates":null,"place":null,"contributors":null,"retweet_count":0,"favorite_count":1,"entities":{"hashtags":[],"symbols":[],"user_mentions":[{"screen_name":"e_","name":"\u2601\ufe0f effy","id":3998,"id_str":"398","indices":[0,15]}],"urls":[]},"favorited":false,"retweeted":false,"lang":"en"}]}

PigScript:

REGISTER '/tmp/json-simple-1.1.jar';
REGISTER '/tmp/elephant-bird-hadoop-compat-4.1.jar';
REGISTER '/tmp/elephant-bird-pig-4.1.jar';

loadJson = LOAD 'input.json' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') AS (json:map []);
B = FOREACH loadJson GENERATE flatten(json#'tweets') as (m:map[]);
C = FOREACH B GENERATE FLATTEN(m#'text');
DUMP C;

Output:
(@Beace_ your nan makes me laugh with some of the things she comes out with)

这篇关于Json与猪中的大象鸟解析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆