无法在 Spark 中配置 ORC 属性 [英] Unable to configure ORC properties in Spark

查看：53 发布时间：2021/12/28 23:39:50 java hadoop apache-spark hive cloudera

本文介绍了无法在 Spark 中配置 ORC 属性的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我使用的是 Spark 1.6 (Cloudera 5.8.2) 并尝试了以下方法来配置 ORC 属性.但是不影响输出.

I am using Spark 1.6 (Cloudera 5.8.2) and tried below methods to configure ORC properties. But it does not effect output.

下面是我试过的代码片段.

Below is the code snippet i tried.

 DataFrame dataframe =
                hiveContext.createDataFrame(rowData, schema);
dataframe.write().format("orc").options(new HashMap(){
            {

                put("orc.compress","SNAPPY");
                put("hive.exec.orc.default.compress","SNAPPY");

                put("orc.compress.size","524288");
                put("hive.exec.orc.default.buffer.size","524288");


                put("hive.exec.orc.compression.strategy", "COMPRESSION");

            }
        }).save("spark_orc_output");

除此之外，我还尝试了在 hive-site.xml 和 hiveContext 对象中设置的这些属性.

Apart from this, i tried these properties set in hive-site.xml and hiveContext object also.

hive --orcfiledump 输出确认未应用配置.Orcfiledump 片段如下.

hive --orcfiledump on output confirms that the configurations not applied. Orcfiledump snippet is below.

Compression: ZLIB
Compression size: 262144

推荐答案

您在这里犯了两个不同的错误.我不怪你；我去过那里...

You are making two different errors here. I don't blame you; I've been there...

问题 #1
orc.compress 和其余的不是 Spark DataFrameWriter 选项.它们是 Hive 配置属性，必须在创建 hiveContext 对象之前定义...

Issue #1
orc.compress and the rest are not Spark DataFrameWriter options. They are Hive configuration properties, that must be defined before creating the hiveContext object...

在启动时 Spark 可用的 hive-site.xml 中
或在您的代码中，通过重新创建 SparkContext...

sc.getConf.get("orc.compress","")//依赖于 Hadoop conf
sc.stop
val scAlt = new org.apache.spark.SparkContext((new org.apache.spark.SparkConf).set("orc.compress","snappy"))
scAlt.getConf.get("orc.compress","<undefined>")//现在是 Snappy
val hiveContextAlt = new org.apache.spark.sql.SQLContext(scAlt)

使用 Spark 2.x，脚本将变成...
spark.sparkContext.getConf.get("orc.compress","")//依赖于 Hadoop conf
spark.close
val sparkAlt = org.apache.spark.sql.SparkSession.builder().config("orc.compress","snappy").getOrCreate()
sparkAlt.sparkContext.getConf.get("orc.compress","<undefined>")//现在是 Snappy

问题 #2
Spark 将自己的 SerDe 库用于 ORC(以及 Parquet、JSON、CSV 等)，因此它不必遵守标准的 Hadoop/Hive 属性.

Issue #2
Spark uses its own SerDe libraries for ORC (and Parquet, JSON, CSV, etc) so it does not have to honor the standard Hadoop/Hive properties.

Parquet 有一些特定于 Spark 的属性，它们是有据可查.但同样，必须在创建(或重新创建)hiveContext 之前设置这些属性.

There are some Spark-specific properties for Parquet, and they are well documented. But again, these properties must be set before creating (or re-creating) the hiveContext.

对于 ORC 和其他格式，您必须求助于特定格式的 DataFrameWriter 选项；引用最新的 JavaDoc...

For ORC and the other formats, you have to resort to format-specific DataFrameWriter options; quoting the latest JavaDoc...

您可以设置以下 ORC 特定的选项来编写 ORC文件:
• compression(默认snappy):压缩编解码器在以下情况下使用保存到文件.这可以是已知的不区分大小写的缩短之一名称(none、snappy、zlib 和 lzo).这将覆盖 orc.compress

You can set the following ORC-specific option(s) for writing ORC files:
• compression (default snappy): compression codec to use when saving to file. This can be one of the known case-insensitive shorten names (none, snappy, zlib, and lzo). This will override orc.compress

请注意，Spark 2 更改了默认压缩编解码器；之前是 zlib

所以你唯一可以设置的是压缩编解码器，使用

So the only thing you can set is the compression codec, using

dataframe.write().format("orc").option("compression","snappy").save("wtf")

这篇关于无法在 Spark 中配置 ORC 属性的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

无法在 Spark 中配置 ORC 属性 [英] Unable to configure ORC properties in Spark

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录关闭

无法在 Spark 中配置 ORC 属性 [英] Unable to configure ORC properties in Spark

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录 关闭

登录关闭