使用Python在Parquet中嵌套数据 [英] Nested data in Parquet with Python

查看:194
本文介绍了使用Python在Parquet中嵌套数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个文件,每行有一个JSON。这是一个示例:

  {
product:{
id: abcdef ,
价格:19.99,
规格:{
电压: 110v,
颜色:白色
}
},
user: Daniel Severo
}

我想用以下列创建一个实木复合地板文件:

  product.id,product.price,product.specs.voltage, product.specs.color,用户

我知道镶木地板使用Dremel算法具有嵌套编码,但是我一直无法在python中使用它(不确定为什么)。



我是大熊猫和笨拙的用户,所以我正在尝试使用管道构造是 json数据->快->实木复合地板->熊猫,尽管如果有人有使用Python在镶木地板中创建和读取这些嵌套编码的简单示例,我认为就足够了:D



编辑



因此,在深入了解PR之后,我发现: https://github.com/dask/fastparquet/pull/177



这基本上就是我想要做的。虽然,我仍然无法使其始终有效。如何准确告诉dask / fastparquet我的产品列是嵌套的?




解决方案

在任意Parquet嵌套数据的读写路径上都进行转换是很复杂的-实现粉碎和重组算法以及与某些Python数据结构的关联转换。我们在Arrow / parquet-cpp的路线图上有此功能(请参阅 https ://github.com/apache/parquet-cpp/tree/master/src/parquet/arrow ),但尚未完成(现在仅支持对简单结构和列表/数组的支持)。拥有此功能很重要,因为其他使用Parquet的系统(例如Impala,Hive,Presto,Drill和Spark)在其SQL方言中对嵌套类型具有本机支持,因此我们需要能够忠实地读写这些结构



这也可以类似地在fastparquet中实现,但是无论如何切片,这都会需要很多工作(以及编写测试用例)。 。



如果没人能击败我,我将在今年晚些时候亲自上班(镶木地板),但是我很乐意得到一些帮助。 / p>

I have a file that has one JSON per line. Here is a sample:

{
    "product": {
        "id": "abcdef",
        "price": 19.99,
        "specs": {
            "voltage": "110v",
            "color": "white"
        }
    },
    "user": "Daniel Severo"
}

I want to create a parquet file with columns such as:

product.id, product.price, product.specs.voltage, product.specs.color, user

I know that parquet has a nested encoding using the Dremel algorithm, but I haven't been able to use it in python (not sure why).

I'm a heavy pandas and dask user, so the pipeline I'm trying to construct is json data -> dask -> parquet -> pandas, although if anyone has a simple example of creating and reading these nested encodings in parquet using Python I think that would be good enough :D

EDIT

So, after digging in the PRs I found this: https://github.com/dask/fastparquet/pull/177

which is basically what I want to do. Although, I still can't make it work all the way through. How exactly do I tell dask/fastparquet that my product column is nested?

解决方案

Implementing the conversions on both the read and write path for arbitrary Parquet nested data is quite complicated to get right -- implementing the shredding and reassembly algorithm with associated conversions to some Python data structures. We have this on the roadmap in Arrow / parquet-cpp (see https://github.com/apache/parquet-cpp/tree/master/src/parquet/arrow), but it has not been completed yet (only support for simple structs and lists/arrays are supported now). It is important to have this functionality because other systems that use Parquet, like Impala, Hive, Presto, Drill, and Spark, have native support for nested types in their SQL dialects, so we need to be able to read and write these structures faithfully from Python.

This can be analogously implemented in fastparquet as well, but it's going to be a lot of work (and test cases to write) no matter how you slice it.

I will likely take on the work (in parquet-cpp) personally later this year if no one beats me to it, but I would love to have some help.

这篇关于使用Python在Parquet中嵌套数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆