使用OpenRefine(或R)解析复杂的大型Json文件的最佳方法 [英] Best way to parse a big and intricated Json file with OpenRefine (or R)

查看:366
本文介绍了使用OpenRefine(或R)解析复杂的大型Json文件的最佳方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我知道如何在Openfine中解析json单元格,但这对我来说太棘手了.

I know how to parse json cells in Open refine, but this one is too tricky for me.

我已经使用API​​提取了4730个AirBNB房间的日历,这些日历由其ID标识.

I've used an API to extract the calendar of 4730 AirBNB's rooms, identified by their IDs.

以下是一个Json文件的示例:

Here is an example of one Json file : https://fr.airbnb.com/api/v2/calendar_months?key=d306zoyjsyarp7ifhu67rjxn52tv0t20&currency=EUR&locale=fr&listing_id=4212133&month=11&year=2016&count=12&_format=with_conditions

对于从现在到2017年11月的每个ID和一年中的每一天,我想提取该房间的可用性(真或假)及其当日价格.

For each ID and each day of the year from now until november 2017, i would like to extract the availability of this rooms (true or false) and its price at this day.

我不知道如何解析这些信息.我想这意味着一系列嵌套的forEach,但是我找不到使用Open Refine做到这一点的正确方法.

I can't figure out how to parse out these informations. I guess that it implies a series of nested forEach, but i can't find the right way to do this with Open Refine.

我当然尝试过

forEach(value.parseJson().calendar_months, e, e.days) 

结果是一系列字典的数组打乱了我.

The result is an array of arrays of dictionnaries that disrupts me.

任何帮助将不胜感激.如果在Open Refine中操作太困难,那么使用R(或Python)解决方案对我来说也很好.

Any help would be appreciate. If the operation is too difficult in Open Refine, a solution with R (or Python) would also be fine for me.

推荐答案

我认为您的做法正确.输出:

I think you are on the right track. The output of:

forEach(value.parseJson().calendar_months, e, e.days) 

很难阅读,因为OpenRefine和JSON都使用方括号来表示数组.从该表达式中得到的是一个包含十二个项目的OR数组(一年中的每个月一个). OR数组中的项目是JSON-每个项目都是一个月中的几天的数组.

is hard to read because OpenRefine and JSON both use square brackets to indicate arrays. What you are getting from this expression is an OR array containing twelve items (one for each month of the year). The items in the OR array are JSON - each one an array of days in the month.

为了使步骤易于管理,我建议像这样处理它:

To keep the steps manageable I'd suggest tackling it like this:

首次使用

forEach(value.parseJson().calendar_months,m,m.days).join("|")

您必须使用'join',因为OR不能将OR数组直接存储在单元格中-它必须是字符串.

You have to use 'join' because OR can't store OR arrays directly in a cell - it has to be a string.

然后使用编辑单元格->拆分多值单元格"-每个ID将为您提供12行,每行包含一个JSON表达式.现在,对于每个ID,您在OR中有12行

Then use "Edit Cells->Split multi-valued cells" - this will get you 12 rows per ID, each containing a JSON expression. Now for each ID you have 12 rows in OR

然后使用:

forEach(value.parseJson(),d,d).join("|")

这会将JSON分为不同的日子

This splits the JSON down into the individual days

然后再次使用编辑单元格->拆分多值单元格"将每天的详细信息拆分到其自己的单元格中.

Then use "Edit Cells->Split multi-valued cells" again to split the details for each day into its own cell.

使用上述示例URL中的JSON-这为我提供了441个ID的行-每个行均包含描述可用性&的JSON.一天的价格.此时,您可以使用ID列上的填充"功能为每一行填写ID.

Using the JSON from example URL above - this gives me 441 rows for the single ID - each contains the JSON describing the availability & price for a single day. At this point you can use the 'fill down' function on the ID column to fill in the ID for each of the rows.

您现在在每个单元格中都有一些非常简单的JSON-这样您就可以使用

You've now got some pretty easy JSON in each cell - so you can extract availability using

value.parseJson().available

这篇关于使用OpenRefine(或R)解析复杂的大型Json文件的最佳方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆