无法在Spark中配置ORC属性 [英] Unable to configure ORC properties in Spark

查看：1035 发布时间：2018/5/31 19:26:18 java hadoop apache-spark hive cloudera

本文介绍了无法在Spark中配置ORC属性的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我使用的是Spark 1.6（Cloudera 5.8.2），并尝试使用下面的方法来配置ORC属性。但它不影响输出。

以下是我试过的代码片段。

  DataFrame dataframe = 
 hiveContext.createDataFrame（rowData，schema）; 
 dataframe.write（）。format（orc）。options（new HashMap（）{
 {
 
 put（orc.compress，SNAPPY）; 
 put（hive.exec.orc.default.compress，SNAPPY）; 
 
 put（orc.compress.size，524288）; 
 put（hive.exec.orc.default.buffer.size，524288）; 
 
 
 put（hive.exec.orc.compression.strategy，COMPRESSION ）; 
 
} 
}）。save（spark_orc_output）;

除此之外，我尝试了在hive-site.xml和hiveContext对象中设置的这些属性。 / p>

配置单元 - 输出上的ororfumpump确认未应用配置。

 压缩：ZLIB 
压缩大小：262144

解决方案

你在这里犯了两个不同的错误。我不怪你;我一直在那里......

问题＃1

orc.compress ，其余不是Spark DataFrameWriter 选项。它们是Hive配置属性，必须在创建 hiveContext 对象之前定义 ...

...

在启动时可用于Spark的 hive-site.xml 中

或in通过重新创建 SparkContext ...

sc.getConf.get（orc.compress，< undefined>）//依赖于Hadoop conf

sc.stop

val scAlt = new org.apache.spark.SparkContext（（new org.apache.spark.SparkConf） .set（orc.compress，snappy））

scAlt.getConf.get（orc.compress，< ; undefined>）//现在是Snappy

val hiveContextAlt = new org.apache.spark.sql.SQLContext（scAlt）

用Spark 2.x编辑脚本会变成...

spark.sparkContext.getConf.get（orc.compress，< undefined>）//依赖于Hadoop conf spark.close val sparkAlt = org。 apache.spark.sql.SparkSession.builder（）.config（orc.compress，snappy）。getOrCreate（） sparkAlt.sparkContext.getConf.get（orc.compress，< undefined>）//现在将是Snappy
<问题2 Spark使用自己的SerDe库进行ORC（以及Parquet，JSON，CSV等），因此它不必遵守标准的Hadoop / Hive属性。 Parquet中有一些Spark特有的属性，它们是有据可查。但是，这些属性必须在创建（或重新创建） hiveContext 之前设置。对于ORC和其他格式，您必须求助于格式特定的 DataFrameWriter 选项;引用最新的 JavaDoc ... 您可以设置以下ORC特定选项来编写ORC 文件： •压缩（默认 snappy ）：当保存时使用的压缩编解码器归档。这可以是已知的不区分大小写的名称之一（ none ， snappy ， zlib 和 lzo ）。这将覆盖 orc.compress 请注意，默认压缩编解码器随Spark 2发生了变化;在此之前它是 zlib 因此，您可以设置唯一的压缩编解码器， dataframe.write（）。format（orc）。option（compression，snappy）。save wtf） I am using Spark 1.6 (Cloudera 5.8.2) and tried below methods to configure ORC properties. But it does not effect output. Below is the code snippet i tried. DataFrame dataframe = hiveContext.createDataFrame(rowData, schema); dataframe.write().format("orc").options(new HashMap(){ { put("orc.compress","SNAPPY"); put("hive.exec.orc.default.compress","SNAPPY"); put("orc.compress.size","524288"); put("hive.exec.orc.default.buffer.size","524288"); put("hive.exec.orc.compression.strategy", "COMPRESSION"); } }).save("spark_orc_output"); Apart from this, i tried these properties set in hive-site.xml and hiveContext object also. hive --orcfiledump on output confirms that the configurations not applied. Orcfiledump snippet is below. Compression: ZLIB Compression size: 262144 解决方案 You are making two different errors here. I don't blame you; I've been there... Issue #1 orc.compress and the rest are not Spark DataFrameWriter options. They are Hive configuration properties, that must be defined before creating the hiveContext object... either in the hive-site.xml available to Spark at launch time or in your code, by re-creating the SparkContext... sc.getConf.get("orc.compress","<undefined>") // depends on Hadoop conf sc.stop val scAlt = new org.apache.spark.SparkContext((new org.apache.spark.SparkConf).set("orc.compress","snappy")) scAlt.getConf.get("orc.compress","<undefined>") // will now be Snappy val hiveContextAlt = new org.apache.spark.sql.SQLContext(scAlt) [Edit] with Spark 2.x the script would become... spark.sparkContext.getConf.get("orc.compress","<undefined>") // depends on Hadoop conf spark.close val sparkAlt = org.apache.spark.sql.SparkSession.builder().config("orc.compress","snappy").getOrCreate() sparkAlt.sparkContext.getConf.get("orc.compress","<undefined>") // will now be Snappy Issue #2 Spark uses its own SerDe libraries for ORC (and Parquet, JSON, CSV, etc) so it does not have to honor the standard Hadoop/Hive properties. There are some Spark-specific properties for Parquet, and they are well documented. But again, these properties must be set before creating (or re-creating) the hiveContext. For ORC and the other formats, you have to resort to format-specific DataFrameWriter options; quoting the latest JavaDoc... You can set the following ORC-specific option(s) for writing ORC files: • compression (default snappy): compression codec to use when saving to file. This can be one of the known case-insensitive shorten names (none, snappy, zlib, and lzo). This will override orc.compress Note that the default compression codec has changed with Spark 2; before that it was zlib So the only thing you can set is the compression codec, using dataframe.write().format("orc").option("compression","snappy").save("wtf") 这篇关于无法在Spark中配置ORC属性的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

无法在Spark中配置ORC属性 [英] Unable to configure ORC properties in Spark

问题描述

相关文章

Java开发最新文章

热门教程

热门工具

登录关闭

无法在Spark中配置ORC属性 [英] Unable to configure ORC properties in Spark

问题描述

相关文章

Java开发最新文章

热门教程

热门工具

登录 关闭

登录关闭