如何更有效地从Spark重命名HDFS中的文件? [英] How to rename files in hdfs from spark more efficiently?
本文介绍了如何更有效地从Spark重命名HDFS中的文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我有450K JSON,我想根据某些规则在hdfs中重命名它们.为了简单起见,我只给每个后缀添加一个后缀.finished
.
可以通过以下代码来做到这一点:
I have 450K JSONs, and I want to rename them in hdfs based on certain rules. For the sake of simplicity I just add a suffix .finished
to each of them.
A managed to do this, with the following code:
import org.apache.hadoop.fs._
val hdfs = FileSystem.get(sc.hadoopConfiguration)
val files = hdfs.listStatus(new Path(pathToJson))
val originalPath = files.map(_.getPath())
for(i <- originalPath.indices)
{
hdfs.rename(originalPath(i), originalPath(i).suffix(".finished"))
}
但是重命名所有这些需要12分钟.有没有办法使它更快? (也许并行化) 我使用的是Spark 1.6.0.
But it takes 12 minutes to rename all of them. Is there a way to make it faster? (Perhaps parallelize) I use spark 1.6.0.
推荐答案
originalpath.par.foreach(e => hdfs.rename(e,e.suffix("finish")))
originalpath.par.foreach( e => hdfs.rename(e,e.suffix("finish")))
这篇关于如何更有效地从Spark重命名HDFS中的文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文