pandas 无法读取在PySpark中创建的实木复合地板文件 [英] Pandas cannot read parquet files created in PySpark

查看：200 发布时间：2020/5/24 2:46:06 python pandas apache-spark pyspark parquet

本文介绍了 pandas 无法读取在PySpark中创建的实木复合地板文件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在通过以下方式从Spark DataFrame编写实木复合地板文件:

I am writing a parquet file from a Spark DataFrame the following way:

df.write.parquet("path/myfile.parquet", mode = "overwrite", compression="gzip")

这将创建一个包含多个文件的文件夹.

This creates a folder with multiple files in it.

当我尝试将其读入pandas时，会出现以下错误，具体取决于我使用的解析器:

When I try to read this into pandas, I get the following errors, depending on which parser I use:

import pandas as pd
df = pd.read_parquet("path/myfile.parquet", engine="pyarrow")

PyArrow:

pyarrow.lib.check_status中的文件"pyarrow \ error.pxi"，第83行

File "pyarrow\error.pxi", line 83, in pyarrow.lib.check_status

ArrowIOError:无效的实木复合地板文件.页脚已损坏.

ArrowIOError: Invalid parquet file. Corrupt footer.

快速镶木地板:

文件"C:\ Program Files \ Anaconda3 \ lib \ site-packages \ fastparquet \ util.py"，行38，在default_open中返回open(f，mode)

File "C:\Program Files\Anaconda3\lib\site-packages\fastparquet\util.py", line 38, in default_open return open(f, mode)

PermissionError:[Errno 13]权限被拒绝:'path/myfile.parquet'

PermissionError: [Errno 13] Permission denied: 'path/myfile.parquet'

我正在使用以下版本:

火花2.4.0
熊猫0.23.4
金字塔0.10.0
fastparquet 0.2.1

我尝试了gzip以及快速压缩.两者都不起作用.我当然确保将文件放在Python有权读取/写入的位置.

I tried gzip as well as snappy compression. Both do not work. I of course made sure that I have the file in a location where Python has permissions to read/write.

如果有人能够重现此错误，这将有所帮助.

It would already help if somebody was able to reproduce this error.

推荐答案

由于即使使用较新的pandas版本，这似乎仍然是一个问题，因此我编写了一些函数来规避此问题，作为更大的pyspark helpers库的一部分:

Since this still seems to be an issue even with newer pandas versions, I wrote some functions to circumvent this as part of a larger pyspark helpers library:

import pandas as pd
import datetime

def read_parquet_folder_as_pandas(path, verbosity=1):
  files = [f for f in os.listdir(path) if f.endswith("parquet")]

  if verbosity > 0:
    print("{} parquet files found. Beginning reading...".format(len(files)), end="")
    start = datetime.datetime.now()

  df_list = [pd.read_parquet(os.path.join(path, f)) for f in files]
  df = pd.concat(df_list, ignore_index=True)

  if verbosity > 0:
    end = datetime.datetime.now()
    print(" Finished. Took {}".format(end-start))
  return df


def read_parquet_as_pandas(path, verbosity=1):
  """Workaround for pandas not being able to read folder-style parquet files.
  """
  if os.path.isdir(path):
    if verbosity>1: print("Parquet file is actually folder.")
    return read_parquet_folder_as_pandas(path, verbosity)
  else:
    return pd.read_parquet(path)

这假设实木复合地板中的相关文件文件"(实际上是一个文件夹)以".parquet"结尾.此功能适用于数据砖导出的镶木文件，也可以与其他文件一起使用(未经测试，对评论中的反馈感到满意).

This assumes that the relevant files in the parquet "file", which is actually a folder, end with ".parquet". This works for parquet files exported by databricks and might work with others as well (untested, happy about feedback in the comments).

如果read_parquet_as_pandas()函数是否为文件夹，则不知道该函数是否可用.

The function read_parquet_as_pandas() can be used if it is not known beforehand whether it is a folder or not.

这篇关于 pandas 无法读取在PySpark中创建的实木复合地板文件的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

pandas 无法读取在PySpark中创建的实木复合地板文件 [英] Pandas cannot read parquet files created in PySpark

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

pandas 无法读取在PySpark中创建的实木复合地板文件 [英] Pandas cannot read parquet files created in PySpark

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭