加载Parquet文件时无法推断架构 [英] Unable to infer schema when loading Parquet file

查看:553
本文介绍了加载Parquet文件时无法推断架构的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

response = "mi_or_chd_5"

outcome = sqlc.sql("""select eid,{response} as response
from outcomes
where {response} IS NOT NULL""".format(response=response))
outcome.write.parquet(response, mode="overwrite") # Success
print outcome.schema
StructType(List(StructField(eid,IntegerType,true),StructField(response,ShortType,true)))

但是然后:

outcome2 = sqlc.read.parquet(response)  # fail

失败:

AnalysisException: u'Unable to infer schema for Parquet. It must be specified manually.;'

/usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc in deco(*a, **kw)

镶木地板的文档说格式是自描述的,保存镶木地板文件时可以使用完整的架构.有什么作用?

The documentation for parquet says the format is self describing, and the full schema was available when the parquet file was saved. What gives?

使用Spark 2.1.1.在2.2.0中也失败.

Using Spark 2.1.1. Also fails in 2.2.0.

找到了此错误报告,但已在中进行了修复 2.0.1,2.1.0.

Found this bug report, but was fixed in 2.0.1, 2.1.0.

更新:与master ="local"连接时可以正常工作,而与master ="mysparkcluster"连接时可以失败.

UPDATE: This work when on connected with master="local", and fails when connected to master="mysparkcluster".

推荐答案

当您尝试读取空目录作为拼花地板时,通常会发生此错误. 您的结果 数据框可能为空.

This error usually occurs when you try to read an empty directory as parquet. Probably your outcome Dataframe is empty.

您可以在写入数据框之前先用outcome.rdd.isEmpty()检查数据框是否为空.

You could check if the DataFrame is empty with outcome.rdd.isEmpty() before write it.

这篇关于加载Parquet文件时无法推断架构的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆