ORC 文件上的 Spark SQL 不返回正确的架构(列名) [英] Spark SQL on ORC files doesn't return correct Schema (Column names)

查看：26 发布时间：2021/12/28 23:31:16 apache-spark apache-spark-sql apache-hive

本文介绍了ORC 文件上的 Spark SQL 不返回正确的架构(列名)的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

我有一个包含 ORC 文件的目录.我正在使用下面的代码创建一个 DataFrame

I have a directory containing ORC files. I am creating a DataFrame using the below code

var data = sqlContext.sql("SELECT * FROM orc.`/directory/containing/orc/files`");

它返回具有此架构的数据帧

It returns data frame with this schema

[_col0: int, _col1: bigint]

预期架构在哪里

[scan_nbr: int, visit_nbr: bigint]

当我查询镶木地板格式的文件时，我得到了正确的架构.

When I query on files in parquet format I get correct schema.

我是否缺少任何配置?

添加更多细节

这是 Hortonworks Distribution HDP 2.4.2(Spark 1.6.1、Hadoop 2.7.1、Hive 1.2.1)

This is Hortonworks Distribution HDP 2.4.2 (Spark 1.6.1, Hadoop 2.7.1, Hive 1.2.1)

我们没有更改 HDP 的默认配置，但这绝对与 Hadoop 的普通版本不同.

We haven't changed the default configurations of HDP, but this is definitely not the same as the plain vanilla version of Hadoop.

数据由上游 Hive 作业写入，一个简单的 CTAS(CREATE TABLE sample STORED AS ORC as SELECT ...).

Data is written by upstream Hive jobs, a simple CTAS (CREATE TABLE sample STORED AS ORC as SELECT ...).

我在 CTAS 使用最新的 2.0.0 hive & 生成的文件上对此进行了测试.它保留了 orc 文件中的列名.

I tested this on filed generated by CTAS with the latest 2.0.0 hive & it preserves the column names in the orc files.

ORC 文件上的 Spark SQL 不返回正确的架构(列名) [英] Spark SQL on ORC files doesn&#39;t return correct Schema (Column names)