Spark dataFrame.colaesce(1)或dataFrame.reapartition(1)似乎不起作用 [英] Spark dataFrame.colaesce(1) or dataFrame.reapartition(1) does not seem to work

查看:84
本文介绍了Spark dataFrame.colaesce(1)或dataFrame.reapartition(1)似乎不起作用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在查询中插入了Hive,这会创建新的Hive分区.我有两个Hive分区,分别名为server和date.现在,我使用以下代码执行插入查询的操作,并尝试将其保存

I have Hive insert into query which creates new Hive partitions. I have two Hive partitions named server and date. Now I execute insert into queries using the following code and try to save it

DataFrame dframe = hiveContext.sql("insert into summary1 partition(server='a1',date='2015-05-22') select from sourcetbl bla bla"); 
//above query creates orc file at /user/db/a1/20-05-22 
//I want only one part-00000 file at the end of above query so I tried the following and none worked 
drame.coalesce(1).write().format("orc").mode(SaveMode.OverWrite).saveAsTable("summary1"); OR

drame.repartition(1).write().format("orc").mode(SaveMode.OverWrite).saveAsTable("summary1"); OR

drame.coalesce(1).write().format("orc").save("/user/db/a1/20-05-22",SaveMode.OverWrite); OR

drame.repartition(1).write().format("orc").save("/user/db/a1/20-05-22",SaveMode.OverWrite); OR

无论我使用合并还是重新分区,上面的查询都会在/user/db/a1/20-05-22位置创建约200个大小约为20 MB的小文件.使用Hive时,出于性能原因,我只需要一个part0000文件.我在想如果我调用 coalesce(1),那么它将创建最终的一个零件文件,但似乎没有发生.我错了吗?

No matter I use coalesce or repartition above query creates around 200 small files around 20 MBs at the location /user/db/a1/20-05-22. I want only one part0000 file for performance reason when using Hive. I was thinking if I call coalesce(1) then it will create final one part file but it does not seem to happen. Am I wrong?

推荐答案

分区管理执行Spark作业时拆分了多少文件,但是文件的实际保存由Hadoop群集管理.

Repartition manages how many pieces of the file are split up when doing the Spark job, however the actual saving of the file is managed by the Hadoop cluster.

或者这就是我的理解方式.您也可以在此处看到回答的相同问题:

Or that's how I understand it. Also you can see the same question answered here: http://mail-archives.us.apache.org/mod_mbox/spark-user/201501.mbox/%3CCA+2Pv=hF5SGC-SWTwTMh6zK2JeoHF1OHPb=WG94vp2GW-vL5SQ@mail.gmail.com%3E

这应该永远不要紧,为什么要在一个文件上进行设置?如果仅适用于您自己的系统,getmerge会为您一起编译它.

This should never matter though, why are you set on a single file? getmerge will compile it together for you if it's just for your own system.

这篇关于Spark dataFrame.colaesce(1)或dataFrame.reapartition(1)似乎不起作用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆