无法推断拼花的架构。必须手动指定 [英] Unable to infer schema for Parquet. It must be specified manually

查看:0
本文介绍了无法推断拼花的架构。必须手动指定的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在运行EMR笔记本中的所有代码。

SPEK.VERSION

'3.0.1-amzn-0'

temp_df.printSchema()

root
 |-- dt: string (nullable = true)
 |-- AverageTemperature: double (nullable = true)
 |-- AverageTemperatureUncertainty: double (nullable = true)
 |-- State: string (nullable = true)
 |-- Country: string (nullable = true)
 |-- year: integer (nullable = true)
 |-- month: integer (nullable = true)
 |-- day: integer (nullable = true)
 |-- weekday: integer (nullable = true)

temp_df.show(2)

+----------+------------------+-----------------------------+-----+-------+----+-----+---+-------+
|        dt|AverageTemperature|AverageTemperatureUncertainty|State|Country|year|month|day|weekday|
+----------+------------------+-----------------------------+-----+-------+----+-----+---+-------+
|1855-05-01|            25.544|                        1.171| Acre| Brazil|1855|    5|  1|      3|
|1855-06-01|            24.228|                        1.103| Acre| Brazil|1855|    6|  1|      6|
+----------+------------------+-----------------------------+-----+-------+----+-----+---+-------+
only showing top 2 rows

temp_df.write.parquet(path=‘s3://project7878/clean_data/temperatures.parquet’, 模式=‘覆盖’,分区依据=[‘Year’])

spark.read.parquet(path=‘s3://project7878/clean_data/temperatures.parquet’).show(2)

An error was encountered:
Unable to infer schema for Parquet. It must be specified manually.;
Traceback (most recent call last):
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 353, in parquet
    return self._df(self._jreader.parquet(_to_seq(self._spark._sc, paths)))
  File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 134, in deco
    raise_from(converted)
  File "<string>", line 3, in raise_from
pyspark.sql.utils.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.;

我引用了其他堆栈溢出帖子,但那里提供的解决方案(由于写入空文件而导致的问题)不适用于我。

请帮帮我。谢谢!!

推荐答案

不要在Read.parket调用中使用path

>>> spark.read.parquet(path='a.parquet')
21/01/02 22:53:38 WARN DataSource: All paths were ignored:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home//bin/spark/python/pyspark/sql/readwriter.py", line 353, in parquet
    return self._df(self._jreader.parquet(_to_seq(self._spark._sc, paths)))
  File "/home//bin/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1304, in __call__
  File "/home//bin/spark/python/pyspark/sql/utils.py", line 134, in deco
    raise_from(converted)
  File "<string>", line 3, in raise_from
pyspark.sql.utils.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.;
>>> spark.read.parquet('a.parquet')
DataFrame[_2: string, _1: double]

这是因为path参数不存在。

如果使用load

则有效
>>> spark.read.load(path='a', format='parquet')
DataFrame[_1: string, _2: string]

这篇关于无法推断拼花的架构。必须手动指定的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆