加入后保存的数据框会创建大量零件文件 [英] Dataframe save after join is creating numerous part files
问题描述
我正在尝试使用 Dataframes 学习编程.使用下面的代码,我试图在列上加入两个 CSV,然后将其另存为组合 CSV.结果,在 SCALA IDE 中运行此代码时,我看到了近 200 个小部件文件.你能帮我理解这里出了什么问题吗-
I am trying to learn programming with Dataframes. With below code i am trying to join two CSV on a column and then saving it as a combined CSV. Running this code in SCALA IDE i am seeing almost 200 small part files as a result. Could you please help me understand what is going wrong here-
import org.apache.spark.SparkContext
object JoinData {
def main(args: Array[String]) {
val sc = new SparkContext(args(0), "Csv Joining example")
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val df1 = sqlContext.load("com.databricks.spark.csv", Map("path" -> args(1), "header" -> "true"))
val df2 = sqlContext.load("com.databricks.spark.csv", Map("path" -> args(2), "header" -> "true"))
import org.apache.spark.sql.functions._
val df_join = df1.join(df2, df1("Dept") === df2("dept"), "inner")
df_join.repartition(1) //This is also not helping
//Below line is generating 200 part files in Output_join folder
df_join.save("Output_join","com.databricks.spark.csv", org.apache.spark.sql.SaveMode.Overwrite)
}
}
使用的程序参数 -本地 src/main/resources/emp.csv src/main/resources/dept.csv
Program arguments used - local src/main/resources/emp.csv src/main/resources/dept.csv
正在使用的 CSV 数据
empId,empName,Dept,salution
111,ABC,sales,mr
112,ABC,it,mr
113,ABC,tech,mr
114,ABC,sales,mr
115,ABC,sales,mr
116,ABC,it,mr
117,ABC,tech,mr
dept,name
sales,Sales of Cap
it,Internal Training
tech,Tech staff
support,support services
控制台输出
[Stage 4:> (2 + 1) / 200]
[Stage 4:=> (4 + 1) / 200]
[Stage 4:=> (6 + 1) / 200]
[Stage 4:==> (8 + 1) / 200]
[Stage 4:===> (11 + 1) / 200]
[Stage 4:===> (14 + 1) / 200]
[Stage 4:====> (17 + 1) / 200]
[Stage 4:=====> (19 + 1) / 200]
[Stage 4:=====> (21 + 1) / 200]
[Stage 4:======> (24 + 1) / 200]
[Stage 4:=======> (26 + 1) / 200]
[Stage 4:=======> (28 + 1) / 200]
[Stage 4:========> (30 + 1) / 200]
[Stage 4:========> (32 + 1) / 200]
[Stage 4:=========> (34 + 1) / 200]
[Stage 4:==========> (37 + 1) / 200]
[Stage 4:===========> (40 + 1) / 200]
[Stage 4:============> (43 + 1) / 200]
[Stage 4:============> (46 + 1) / 200]
[Stage 4:=============> (49 + 1) / 200]
[Stage 4:==============> (52 + 1) / 200]
[Stage 4:===============> (55 + 1) / 200]
[Stage 4:================> (58 + 1) / 200]
[Stage 4:=================> (61 + 1) / 200]
[Stage 4:=================> (64 + 1) / 200]
[Stage 4:==================> (67 + 1) / 200]
[Stage 4:===================> (69 + 1) / 200]
[Stage 4:====================> (72 + 1) / 200]
[Stage 4:=====================> (75 + 1) / 200]
[Stage 4:=====================> (78 + 1) / 200]
[Stage 4:======================> (81 + 1) / 200]
[Stage 4:=======================> (84 + 1) / 200]
[Stage 4:========================> (87 + 1) / 200]
[Stage 4:=========================> (90 + 1) / 200]
[Stage 4:=========================> (92 + 1) / 200]
[Stage 4:==========================> (95 + 1) / 200]
[Stage 4:===========================> (98 + 1) / 200]
[Stage 4:===========================> (101 + 1) / 200]
[Stage 4:============================> (104 + 1) / 200]
[Stage 4:=============================> (107 + 1) / 200]
[Stage 4:==============================> (110 + 1) / 200]
[Stage 4:===============================> (113 + 1) / 200]
[Stage 4:===============================> (116 + 1) / 200]
[Stage 4:================================> (119 + 1) / 200]
[Stage 4:=================================> (122 + 1) / 200]
[Stage 4:=================================> (123 + 1) / 200]
[Stage 4:==================================> (126 + 1) / 200]
[Stage 4:===================================> (129 + 1) / 200]
[Stage 4:====================================> (132 + 1) / 200]
[Stage 4:=====================================> (135 + 1) / 200]
[Stage 4:=====================================> (138 + 1) / 200]
[Stage 4:======================================> (140 + 1) / 200]
[Stage 4:======================================> (141 + 1) / 200]
[Stage 4:=======================================> (144 + 1) / 200]
[Stage 4:========================================> (148 + 1) / 200]
[Stage 4:=========================================> (151 + 1) / 200]
[Stage 4:==========================================> (154 + 1) / 200]
[Stage 4:==========================================> (156 + 2) / 200]
[Stage 4:===========================================> (159 + 1) / 200]
[Stage 4:============================================> (161 + 1) / 200]
[Stage 4:============================================> (162 + 1) / 200]
[Stage 4:=============================================> (164 + 1) / 200]
[Stage 4:=============================================> (165 + 1) / 200]
[Stage 4:==============================================> (168 + 1) / 200]
[Stage 4:===============================================> (171 + 1) / 200]
[Stage 4:===============================================> (174 + 1) / 200]
[Stage 4:================================================> (177 + 1) / 200]
[Stage 4:=================================================> (180 + 1) / 200]
[Stage 4:==================================================> (183 + 1) / 200]
[Stage 4:===================================================> (186 + 1) / 200]
[Stage 4:===================================================> (189 + 1) / 200]
[Stage 4:=====================================================> (193 + 1) / 200]
[Stage 4:=====================================================> (196 + 1) / 200]
[Stage 4:======================================================>(199 + 1) / 200]
推荐答案
在加入时更改用于创建分区数量的默认分区解决了这个问题.
sqlContext.setConf("spark.sql.shuffle.partitions", "2")
Changing default partitons being used for creating number of partition while joining solved this problem.
sqlContext.setConf("spark.sql.shuffle.partitions", "2")
https://spark.apache.org/docs/1.1.0/sql-programming-guide.html#other-configuration-options
这篇关于加入后保存的数据框会创建大量零件文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!