DataFrame.write.parquet-HIVE或Impala无法读取实木复合地板文件 [英] DataFrame.write.parquet - Parquet-file cannot be read by HIVE or Impala

查看:68
本文介绍了DataFrame.write.parquet-HIVE或Impala无法读取实木复合地板文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用以下命令将带有pySpark的DataFrame写入HDFS:

I wrote a DataFrame with pySpark into HDFS with this command:

df.repartition(col("year"))\
.write.option("maxRecordsPerFile", 1000000)\
.parquet('/path/tablename', mode='overwrite', partitionBy=["year"], compression='snappy')

查看HDFS时,我可以看到文件正确放置在此处.无论如何,当我尝试使用HIVE或Impala读取表格时,找不到该表格.

When taking a look into the HDFS I can see that the files are properly laying there. Anyhow, when I try to read the table with HIVE or Impala, the table cannot be found.

这是怎么回事,我错过了什么吗?

Whats going wrong here, am I missing something?

有趣的是, df.write.format('parquet').saveAsTable("tablename")正常工作.

推荐答案

这是spark的预期行为,

It's an expected behaviour from spark as:

  • df ... etc.parquet(") 将数据写入

  • df...etc.parquet("") writes the data to HDFS location and won't create any table in Hive.

df..saveAsTable(") 创建

but df..saveAsTable("") creates the table in hive and writes data to it.

在表已经存在的情况下,此函数的行为取决于由模式功能指定的保存模式(默认为引发异常).当mode为Overwrite时,DataFrame不必与现有表相同.

In the case the table already exists, behavior of this function depends on the save mode, specified by the mode function (default to throwing an exception). When mode is Overwrite, the schema of the DataFrame does not need to be the same as that of the existing table.

这就是为什么在执行 df ... parquet(") 无法在配置单元中找到表 的原因>

That's the reason why you are not able to find table in hive after performing df...parquet("")

这篇关于DataFrame.write.parquet-HIVE或Impala无法读取实木复合地板文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆