用pyarrow vs pyspark创建的拼花文件是否兼容? [英] Are parquet file created with pyarrow vs pyspark compatible?

查看:153
本文介绍了用pyarrow vs pyspark创建的拼花文件是否兼容?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我必须分两个步骤将JSON中的分析数据转换为实木复合地板。对于大量现有数据,我正在编写一个PySpark作业,并在做

I have to convert analytics data in JSON to parquet in two steps. For the large amounts of existing data I am writing a PySpark job and doing

df.repartition(*partitionby).write.partitionBy(partitionby).
    mode("append").parquet(output,compression=codec)

但是我计划使用AWS Lambda增量数据。 PySpark可能对它来说是一个过大的杀伤力,因此我计划为其使用PyArrow(我知道它不必要地涉及到Pandas,但我找不到更好的替代方法)。因此,基本上是:

however for incremental data I plan to use AWS Lambda. Probably, PySpark would be an overkill for it, and hence I plan to use PyArrow for it (I am aware that it unnecessarily involves Pandas, but I couldn't find a better alternative). So, basically:

import pyarrow.parquet as pq
pq.write_table(table, outputPath, compression='snappy',
    use_deprecated_int96_timestamps=True)

我想知道两者是否都编写了Parquet文件PySpark和PyArrow是否兼容(相对于Athena)?

I wanted to know if the Parquet files written by both PySpark and PyArrow will be compatible (with respect to Athena)?

推荐答案

pyarrow (长名称:Apache Arrow)与Apache Spark兼容。但是您必须小心,将哪些数据类型写入Parquet文件中,因为Apache Arrow比Apache Spark支持更多的数据类型。当前在 pyarrow 中有一个标志 flavor = spark ,您可以使用该标志自动设置一些兼容性选项,以便Spark可以再次读取这些文件。可悲的是,在最新版本中,此选项是不够的(期望通过 pyarrow == 0.9.0 进行更改)。您应该小心使用不推荐使用的INT96类型( use_deprecated_int96_timestamps = True )写出时间戳,并避免使用无符号整数列。对于无符号整数列,将其简单地转换为有符号整数。遗憾的是,如果您的架构中有未签名的类型,而不是只是将它们加载为已签名(实际上,它们始终存储为已签名,但仅用标志标记为未签名),则Spark会出错。尊重这两件事,这些文件应该在Apache Spark和AWS Athena(仅是Presto内幕)中可读。

Parquet file written by pyarrow (long name: Apache Arrow) are compatible with Apache Spark. But you have to be careful which datatypes you write into the Parquet files as Apache Arrow supports a wider range of them then Apache Spark does. There is currently a flag flavor=spark in pyarrow that you can use to automatically set some compatibility options so that Spark can read these files in again. Sadly in the latest release, this option is not sufficient (expect to change with pyarrow==0.9.0). You should take care to write out timestamps using the deprecated INT96 type (use_deprecated_int96_timestamps=True) as well as avoiding unsigned integer columns. For the unsigned integer columns, convert them simply to a signed integer. Sadly Spark errors out if you have a unsigned type in your schema instead of just loading them as signed (they are actually always stored as signed, but only marked with a flag as unsigned). Respecting these two things, the files should be readable in Apache Spark and AWS Athena (which is just Presto under the hood).

这篇关于用pyarrow vs pyspark创建的拼花文件是否兼容?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆