无法在Spark中配置ORC属性 [英] Unable to configure ORC properties in Spark

查看:1035
本文介绍了无法在Spark中配置ORC属性的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用的是Spark 1.6(Cloudera 5.8.2),并尝试使用下面的方法来配置ORC属性。但它不影响输出。



以下是我试过的代码片段。

  DataFrame dataframe = 
hiveContext.createDataFrame(rowData,schema);
dataframe.write()。format(orc)。options(new HashMap(){
{

put(orc.compress,SNAPPY);
put(hive.exec.orc.default.compress,SNAPPY);

put(orc.compress.size,524288);
put(hive.exec.orc.default.buffer.size,524288);


put(hive.exec.orc.compression.strategy,COMPRESSION );

}
})。save(spark_orc_output);

除此之外,我尝试了在hive-site.xml和hiveContext对象中设置的这些属性。 / p>

配置单元 - 输出上的ororfumpump确认未应用配置。

 压缩:ZLIB 
压缩大小:262144


解决方案

你在这里犯了两个不同的错误。我不怪你;我一直在那里......



问题#1

orc.compress ,其余不是Spark DataFrameWriter 选项。它们是Hive配置属性,必须在创建 hiveContext 对象之前定义 ...

...


  • 在启动时可用于Spark的 hive-site.xml

  • 或in通过重新创建 SparkContext ...



  sc.getConf.get(orc.compress,< undefined>)//依赖于Hadoop conf


  sc.stop

  val scAlt = new org.apache.spark.SparkContext((new org.apache.spark.SparkConf) .set(orc.compress,snappy))

  scAlt.getConf.get(orc.compress,< ; undefined>)//现在是Snappy

  val hiveContextAlt = new org.apache.spark.sql.SQLContext(scAlt)


用Spark 2.x编辑脚本会变成...

  spark.sparkContext.getConf.get(orc.compress,< undefined>)//依赖于Hadoop conf

  spark.close

  val sparkAlt = org。 apache.spark.sql.SparkSession.builder().config(orc.compress,snappy)。getOrCreate()

  sparkAlt.sparkContext.getConf.get(orc.compress,< undefined>)//现在将是Snappy



<问题2

Spark使用自己的SerDe库进行ORC(以及Parquet,JSON,CSV等),因此它不必遵守标准的Hadoop / Hive属性。

Parquet中有一些Spark特有的属性,它们是有据可查。但是,这些属性必须在创建(或重新创建) hiveContext 之前设置



对于ORC和其他格式,您必须求助于格式特定的 DataFrameWriter 选项;引用最新的 JavaDoc ...


您可以设置以下ORC特定选项来编写ORC
文件:

压缩(默认 snappy ):当
保存时使用的压缩编解码器归档。这可以是已知的不区分大小写的
名称之一( none , snappy zlib lzo )。这将覆盖 orc.compress


请注意,默认压缩编解码器随Spark 2发生了变化;在此之前它是 zlib



因此,您可以设置唯一的压缩编解码器,

  dataframe.write()。format(orc)。option(compression,snappy)。save wtf)


I am using Spark 1.6 (Cloudera 5.8.2) and tried below methods to configure ORC properties. But it does not effect output.

Below is the code snippet i tried.

 DataFrame dataframe =
                hiveContext.createDataFrame(rowData, schema);
dataframe.write().format("orc").options(new HashMap(){
            {

                put("orc.compress","SNAPPY");
                put("hive.exec.orc.default.compress","SNAPPY");

                put("orc.compress.size","524288");
                put("hive.exec.orc.default.buffer.size","524288");


                put("hive.exec.orc.compression.strategy", "COMPRESSION");

            }
        }).save("spark_orc_output");

Apart from this, i tried these properties set in hive-site.xml and hiveContext object also.

hive --orcfiledump on output confirms that the configurations not applied. Orcfiledump snippet is below.

Compression: ZLIB
Compression size: 262144

解决方案

You are making two different errors here. I don't blame you; I've been there...

Issue #1
orc.compress and the rest are not Spark DataFrameWriter options. They are Hive configuration properties, that must be defined before creating the hiveContext object...

  • either in the hive-site.xml available to Spark at launch time
  • or in your code, by re-creating the SparkContext...

 sc.getConf.get("orc.compress","<undefined>") // depends on Hadoop conf
 sc.stop
 val scAlt = new org.apache.spark.SparkContext((new org.apache.spark.SparkConf).set("orc.compress","snappy"))
 scAlt.getConf.get("orc.compress","<undefined>") // will now be Snappy
 val hiveContextAlt = new org.apache.spark.sql.SQLContext(scAlt)

[Edit] with Spark 2.x the script would become...
 spark.sparkContext.getConf.get("orc.compress","<undefined>") // depends on Hadoop conf
 spark.close
 val sparkAlt = org.apache.spark.sql.SparkSession.builder().config("orc.compress","snappy").getOrCreate()
 sparkAlt.sparkContext.getConf.get("orc.compress","<undefined>") // will now be Snappy

Issue #2
Spark uses its own SerDe libraries for ORC (and Parquet, JSON, CSV, etc) so it does not have to honor the standard Hadoop/Hive properties.

There are some Spark-specific properties for Parquet, and they are well documented. But again, these properties must be set before creating (or re-creating) the hiveContext.

For ORC and the other formats, you have to resort to format-specific DataFrameWriter options; quoting the latest JavaDoc...

You can set the following ORC-specific option(s) for writing ORC files:
compression (default snappy): compression codec to use when saving to file. This can be one of the known case-insensitive shorten names (none, snappy, zlib, and lzo). This will override orc.compress

Note that the default compression codec has changed with Spark 2; before that it was zlib

So the only thing you can set is the compression codec, using

dataframe.write().format("orc").option("compression","snappy").save("wtf")

这篇关于无法在Spark中配置ORC属性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆