无法在Spark中配置ORC属性 [英] Unable to configure ORC properties in Spark
问题描述
我使用的是Spark 1.6(Cloudera 5.8.2),并尝试使用下面的方法来配置ORC属性。但它不影响输出。
以下是我试过的代码片段。
DataFrame dataframe =
hiveContext.createDataFrame(rowData,schema);
dataframe.write()。format(orc)。options(new HashMap(){
{
put(orc.compress,SNAPPY);
put(hive.exec.orc.default.compress,SNAPPY);
put(orc.compress.size,524288);
put(hive.exec.orc.default.buffer.size,524288);
put(hive.exec.orc.compression.strategy,COMPRESSION );
}
})。save(spark_orc_output);
除此之外,我尝试了在hive-site.xml和hiveContext对象中设置的这些属性。 / p>
配置单元 - 输出上的ororfumpump确认未应用配置。
压缩:ZLIB
压缩大小:262144
你在这里犯了两个不同的错误。我不怪你;我一直在那里......
问题#1
orc.compress
,其余不是Spark DataFrameWriter
选项。它们是Hive配置属性,必须在创建 hiveContext
对象之前定义 ...
- 在启动时可用于Spark的
hive-site.xml
中 - 或in通过重新创建
SparkContext
...
sc.getConf.get(orc.compress,< undefined>)//依赖于Hadoop conf
sc.stop
val scAlt = new org.apache.spark.SparkContext((new org.apache.spark.SparkConf) .set(orc.compress,snappy))
scAlt.getConf.get(orc.compress,< ; undefined>)//现在是Snappy
val hiveContextAlt = new org.apache.spark.sql.SQLContext(scAlt)
用Spark 2.x编辑脚本会变成...
spark.sparkContext.getConf.get(orc.compress,< undefined>)//依赖于Hadoop conf
spark.close
val sparkAlt = org。 apache.spark.sql.SparkSession.builder().config(orc.compress,snappy)。getOrCreate()
sparkAlt.sparkContext.getConf.get(orc.compress,< undefined>)//现在将是Snappy
<问题2
Spark使用自己的SerDe库进行ORC(以及Parquet,JSON,CSV等),因此它不必遵守标准的Hadoop / Hive属性。
Parquet中有一些Spark特有的属性,它们是有据可查。但是,这些属性必须在创建(或重新创建) hiveContext
之前设置。
对于ORC和其他格式,您必须求助于格式特定的 DataFrameWriter
选项;引用最新的 JavaDoc ...
您可以设置以下ORC特定选项来编写ORC
文件:
•压缩
(默认snappy
):当
保存时使用的压缩编解码器归档。这可以是已知的不区分大小写的
名称之一(none ,
snappy
,zlib
和lzo
)。这将覆盖orc.compress
请注意,默认压缩编解码器随Spark 2发生了变化;在此之前它是 zlib
因此,您可以设置唯一的压缩编解码器,
dataframe.write()。format(orc)。option(compression,snappy)。save wtf)
I am using Spark 1.6 (Cloudera 5.8.2) and tried below methods to configure ORC properties. But it does not effect output.
Below is the code snippet i tried.
DataFrame dataframe =
hiveContext.createDataFrame(rowData, schema);
dataframe.write().format("orc").options(new HashMap(){
{
put("orc.compress","SNAPPY");
put("hive.exec.orc.default.compress","SNAPPY");
put("orc.compress.size","524288");
put("hive.exec.orc.default.buffer.size","524288");
put("hive.exec.orc.compression.strategy", "COMPRESSION");
}
}).save("spark_orc_output");
Apart from this, i tried these properties set in hive-site.xml and hiveContext object also.
hive --orcfiledump on output confirms that the configurations not applied. Orcfiledump snippet is below.
Compression: ZLIB
Compression size: 262144
You are making two different errors here. I don't blame you; I've been there...
Issue #1
orc.compress
and the rest are not Spark DataFrameWriter
options. They are Hive configuration properties, that must be defined before creating the hiveContext
object...
- either in the
hive-site.xml
available to Spark at launch time - or in your code, by re-creating the
SparkContext
...
sc.getConf.get("orc.compress","<undefined>") // depends on Hadoop conf
sc.stop
val scAlt = new org.apache.spark.SparkContext((new org.apache.spark.SparkConf).set("orc.compress","snappy"))
scAlt.getConf.get("orc.compress","<undefined>") // will now be Snappy
val hiveContextAlt = new org.apache.spark.sql.SQLContext(scAlt)
[Edit] with Spark 2.x the script would become...
spark.sparkContext.getConf.get("orc.compress","<undefined>") // depends on Hadoop conf
spark.close
val sparkAlt = org.apache.spark.sql.SparkSession.builder().config("orc.compress","snappy").getOrCreate()
sparkAlt.sparkContext.getConf.get("orc.compress","<undefined>") // will now be Snappy
Issue #2
Spark uses its own SerDe libraries for ORC (and Parquet, JSON, CSV, etc) so it does not have to honor the standard Hadoop/Hive properties.
There are some Spark-specific properties for Parquet, and they are well documented. But again, these properties must be set before creating (or re-creating) the hiveContext
.
For ORC and the other formats, you have to resort to format-specific DataFrameWriter
options; quoting the latest JavaDoc...
You can set the following ORC-specific option(s) for writing ORC files:
•compression
(defaultsnappy
): compression codec to use when saving to file. This can be one of the known case-insensitive shorten names (none
,snappy
,zlib
, andlzo
). This will overrideorc.compress
Note that the default compression codec has changed with Spark 2; before that it was zlib
So the only thing you can set is the compression codec, using
dataframe.write().format("orc").option("compression","snappy").save("wtf")
这篇关于无法在Spark中配置ORC属性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!