Spark dataFrame.colaesce(1) 或 dataFrame.reapartition(1) 似乎不起作用 [英] Spark dataFrame.colaesce(1) or dataFrame.reapartition(1) does not seem to work

查看:38
本文介绍了Spark dataFrame.colaesce(1) 或 dataFrame.reapartition(1) 似乎不起作用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我将 Hive 插入到创建新 Hive 分区的查询中.我有两个名为 server 和 date 的 Hive 分区.现在我使用以下代码执行插入查询并尝试保存它

I have Hive insert into query which creates new Hive partitions. I have two Hive partitions named server and date. Now I execute insert into queries using the following code and try to save it

DataFrame dframe = hiveContext.sql("insert into summary1 partition(server='a1',date='2015-05-22') select from sourcetbl bla bla"); 
//above query creates orc file at /user/db/a1/20-05-22 
//I want only one part-00000 file at the end of above query so I tried the following and none worked 
drame.coalesce(1).write().format("orc").mode(SaveMode.OverWrite).saveAsTable("summary1"); OR

drame.repartition(1).write().format("orc").mode(SaveMode.OverWrite).saveAsTable("summary1"); OR

drame.coalesce(1).write().format("orc").save("/user/db/a1/20-05-22",SaveMode.OverWrite); OR

drame.repartition(1).write().format("orc").save("/user/db/a1/20-05-22",SaveMode.OverWrite); OR

无论我使用合并还是重新分区,查询都会在/user/db/a1/20-05-22 位置创建大约 200 个大约 20 MB 的小文件.使用 Hive 时,出于性能原因,我只需要一个 part0000 文件.我在想,如果我调用 coalesce(1) 那么它会创建最终的一个部分文件,但它似乎没有发生.我错了吗?

No matter I use coalesce or repartition above query creates around 200 small files around 20 MBs at the location /user/db/a1/20-05-22. I want only one part0000 file for performance reason when using Hive. I was thinking if I call coalesce(1) then it will create final one part file but it does not seem to happen. Am I wrong?

推荐答案

Repartition 管理在执行 Spark 作业时拆分文件的多少部分,但文件的实际保存由 Hadoop 集群管理.

Repartition manages how many pieces of the file are split up when doing the Spark job, however the actual saving of the file is managed by the Hadoop cluster.

>

或者我是这么理解的.你也可以在这里看到同样的问题:http://mail-archives.us.apache.org/mod_mbox/spark-user/201501.mbox/%3CCA+2Pv=hF5SGC-SWTwTMh6zK2JeoHF1OHPb=WG94vp2GW-vL5SQ@mail.gmail.com%3E

这应该无关紧要,为什么要设置在单个文件上?如果只是为了您自己的系统,getmerge 会为您编译它.

This should never matter though, why are you set on a single file? getmerge will compile it together for you if it's just for your own system.

这篇关于Spark dataFrame.colaesce(1) 或 dataFrame.reapartition(1) 似乎不起作用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆