在Spark中禁用实木复合地板元数据摘要 [英] Disable parquet metadata summary in Spark

查看:93
本文介绍了在Spark中禁用实木复合地板元数据摘要的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个Spark作业(针对1.4.1)接收卡夫卡事件流. 我想将它们作为实木复合地板连续保存在Tachyon中.

I have a spark job (for 1.4.1) receiving a stream of kafka events. I would like to save them continuously as parquet on tachyon.

val lines = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap).map(_._2)

lines.window(Seconds(1), Seconds(1)).foreachRDD { (rdd, time) =>
  if (rdd.count() > 0) {
    val mil = time.floor(Duration(86400000)).milliseconds
    hiveContext.read.json(rdd).toDF().write.mode(SaveMode.Append).parquet(s"tachyon://192.168.1.12:19998/persisted5$mil")
    hiveContext.sql(s"CREATE TABLE IF NOT EXISTS persisted5$mil USING org.apache.spark.sql.parquet OPTIONS ( path 'tachyon://192.168.1.12:19998/persisted5$mil')")
  }
}

但是我发现随着时间的流逝,每次镶木地板写入时,火花都会遍历每1秒镶木地板部件,并且变得越来越慢

however I see that as time goes on, on every parquet write, spark goes through each 1 sec parquet parts, which get slower and slower

15/08/22 22:04:05 INFO : open(tachyon://192.168.1.12:19998/persisted51440201600000/part-r-00000-db03b24d-6f98-4b5d-bb40-530f35b82633.gz.parquet, 65536)
15/08/22 22:04:05 INFO : open(tachyon://192.168.1.12:19998/persisted51440201600000/part-r-00000-3a7857e2-0435-4ee0-ab2c-6d40224f8842.gz.parquet, 65536)
15/08/22 22:04:05 INFO : open(tachyon://192.168.1.12:19998/persisted51440201600000/part-r-00000-47ff2ac1-da00-4473-b3f7-52640014bc5b.gz.parquet, 65536)
15/08/22 22:04:05 INFO : open(tachyon://192.168.1.12:19998/persisted51440201600000/part-r-00000-61625436-7353-4b1e-bb8d-e8afad3a582e.gz.parquet, 65536)
15/08/22 22:04:05 INFO : open(tachyon://192.168.1.12:19998/persisted51440201600000/part-r-00000-e711aa9a-9bf5-41d5-8523-f5edafa69626.gz.parquet, 65536)
15/08/22 22:04:05 INFO : open(tachyon://192.168.1.12:19998/persisted51440201600000/part-r-00000-4e0cca38-cf75-4771-8965-20a30c863100.gz.parquet, 65536)
15/08/22 22:04:05 INFO : open(tachyon://192.168.1.12:19998/persisted51440201600000/part-r-00000-d1510ed4-2c99-43e2-b3d1-38d3d54e626d.gz.parquet, 65536)
15/08/22 22:04:05 INFO : open(tachyon://192.168.1.12:19998/persisted51440201600000/part-r-00000-022d1918-392d-433f-a7f4-074e46b4460f.gz.parquet, 65536)
15/08/22 22:04:05 INFO : open(tachyon://192.168.1.12:19998/persisted51440201600000/part-r-00000-cf71f5d2-ba0e-4729-9aa1-41dad5d1d08f.gz.parquet, 65536)
15/08/22 22:04:05 INFO : open(tachyon://192.168.1.12:19998/persisted51440201600000/part-r-00000-ce990b1e-82cc-4feb-a162-ac3ddc275609.gz.parquet, 65536)

我得出的结论是,这归因于摘要数据的更新,我相信spark不会使用它们.所以我想禁用它

I came to the conclusion that this is due to the update of summary data, I believe spark do not make use of them. so I would like to disable it

镶木地板来源表明我应该能够将"parquet.enable.summary-metadata"设置为false.

parquet sources shows that I should be able to set "parquet.enable.summary-metadata" to false.

现在,在创建hiveContext之后,我就尝试过这样设置

now, I have tried setting it like this, right after creating hiveContext

hiveContext.sparkContext.hadoopConfiguration.setBoolean("parquet.enable.summary-metadata", false)
hiveContext.sparkContext.hadoopConfiguration.setInt("parquet.metadata.read.parallelism", 10) 

但是没有成功,我仍然得到显示并行度为5(默认)的日志.

but without success, I also still get logs showing a parallelism of 5 (default).

在镶木地板中禁用Spark中的摘要数据的正确方法是什么?

What is the correct way to disable summary data in spark with parquet?

推荐答案

将"parquet.enable.summary-metadata"设置为文本("false"而不是false)似乎对我们有用.

setting "parquet.enable.summary-metadata" as text ("false" and not false) seems to work for us.

通过Spark确实使用_common_metadata文件的方式(我们将其手动复制以进行重复作业)

By the way Spark does use the _common_metadata file (we copy that over manually for repetitive jobs)

这篇关于在Spark中禁用实木复合地板元数据摘要的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆