如果pyspark中已经存在文件,如何覆盖rdd saveAsPickleFile(path)? [英] How to overwrite the rdd saveAsPickleFile(path) if file already exist in pyspark?
问题描述
如何覆盖 当我们节省时间时,RDD会输出任何现有路径的对象.
How to overwrite RDD output objects any existing path when we are saving time.
test1:
975078|56691|2.000|20171001_926_570_1322
975078|42993|1.690|20171001_926_570_1322
975078|46462|2.000|20171001_926_570_1322
975078|87815|1.000|20171001_926_570_1322
rdd=sc.textFile('/home/administrator/work/test1').map( lambda x: x.split("|")[:4]).map( lambda r: Row( user_code = r[0],item_code = r[1],qty = float(r[2])))
rdd.coalesce(1).saveAsPickleFile("/home/administrator/work/foobar_seq1")
第一次正确保存.现在我又从输入中删除了一行 文件和保存RDD的位置相同,表明文件已存在.
The first time it is saving properly. now again I removed one line from the input file and saving RDD same location, it show file has existed.
rdd.coalesce(1).saveAsPickleFile("/home/administrator/work/foobar_seq1")
例如,在数据框中,我们可以覆盖现有路径.
For example, in dataframe we can overwrite existing path.
df.coalesce(1).write().overwrite().save(path)
如果我在RDD对象上执行相同操作,则出现错误.
If I am doing same on RDD object getting an error.
rdd.coalesce(1).write().overwrite().saveAsPickleFile(path)
请帮助我
推荐答案
您可以保存如下所示的RDD文件 注意 (代码在scala中,但python的逻辑也应该相同).我正在使用2.3.0 spark版本.
Hi you can save RDD files like below Note (code is in scala but logic should be same for python as well) i am using 2.3.0 spark version.
val sconf = new SparkConf().set("spark.hadoop.validateOutputSpecs", "False").setMaster("local[*]").setAppName("test")
val scontext = new SparkContext(sconf)
val lines = scontext.textFile("s${filePath}", 1)
println(lines.first)
lines.saveAsTextFile("C:\\Users\\...\\Desktop\\sample2")
或者如果您使用DataFrame,则使用
or if ur working with DataFrame then use
DF.write.mode(SaveMode.Overwrite).parquet(path.parquet)
or for more info please look at this
这篇关于如果pyspark中已经存在文件,如何覆盖rdd saveAsPickleFile(path)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!