星火dataFrame.colaesce(1)或dataFrame.reapartition(1)似乎并没有为我工作 [英] Spark dataFrame.colaesce(1) or dataFrame.reapartition(1) does not seem to work for me

查看:578
本文介绍了星火dataFrame.colaesce(1)或dataFrame.reapartition(1)似乎并没有为我工作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

您好我有蜂巢插入查询,它创造了新的蜂巢分区。我有一个名为服务器和日起二蜂巢分区。现在我执行INSERT INTO使用以下code查询,并尝试将其保存

Hi I have Hive insert into query which creates new Hive partitions. I have two Hive partitions named server and date. Now I execute insert into queries using the following code and try to save it

DataFrame dframe = hiveContext.sql("insert into summary1 partition(server='a1',date='2015-05-22') select from sourcetbl bla bla"); 
//above query creates orc file at /user/db/a1/20-05-22 
//I want only one part-00000 file at the end of above query so I tried the following and none worked 
drame.coalesce(1).write().format("orc").mode(SaveMode.OverWrite).saveAsTable("summary1"); OR

drame.repartition(1).write().format("orc").mode(SaveMode.OverWrite).saveAsTable("summary1"); OR

drame.coalesce(1).write().format("orc").save("/user/db/a1/20-05-22",SaveMode.OverWrite); OR

drame.repartition(1).write().format("orc").save("/user/db/a1/20-05-22",SaveMode.OverWrite); OR

不管我用COALESCE或reparition上面的查询在该位置/用户/ DB / A1 / 20-05-22创造约20 MB的大约200名小文件。我使用蜂巢时,只需要一个性能原因part0000文件。我在想,如果我称之为 COALESCE(1)然后它会创建最后一个部分文件,但它似乎并没有发生。我错了吗?请指导。先谢谢了。

No matter I use coalesce or reparition above query creates around 200 small files around 20 MBs at the location /user/db/a1/20-05-22. I want only one part0000 file for performance reason when using Hive. I was thinking if I call coalesce(1) then it will create final one part file but it does not seem to happen. Am I wrong? Please guide. Thanks in advance.

推荐答案

重新分区管理如何做星火作业时文件的许多作品都分手了,但是文件的实际储蓄是由Hadoop集群管理。

Repartition manages how many pieces of the file are split up when doing the Spark job, however the actual saving of the file is managed by the Hadoop cluster.

或者这就是我的理解。你还可以在这里看到回答同样的问题:<一href=\"http://mail-archives.us.apache.org/mod_mbox/spark-user/201501.mbox/%3CCA+2Pv=hF5SGC-SWTwTMh6zK2JeoHF1OHPb=WG94vp2GW-vL5SQ@mail.gmail.com%3E\" rel=\"nofollow\">http://mail-archives.us.apache.org/mod_mbox/spark-user/201501.mbox/%3CCA+2Pv=hF5SGC-SWTwTMh6zK2JeoHF1OHPb=WG94vp2GW-vL5SQ@mail.gmail.com%3E

Or that's how I understand it. Also you can see the same question answered here: http://mail-archives.us.apache.org/mod_mbox/spark-user/201501.mbox/%3CCA+2Pv=hF5SGC-SWTwTMh6zK2JeoHF1OHPb=WG94vp2GW-vL5SQ@mail.gmail.com%3E

这不应该事,虽然,你为什么在一个文件中设置?如果只是为自己的系统getmerge将共同编译它给你。

This should never matter though, why are you set on a single file? getmerge will compile it together for you if it's just for your own system.

这篇关于星火dataFrame.colaesce(1)或dataFrame.reapartition(1)似乎并没有为我工作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆