如何在 spark scala 中重命名 S3 文件而不是 HDFS [英] How rename S3 files not HDFS in spark scala
问题描述
我在 S3 中存储了大约 1 百万个文本文件.我想根据文件夹名称重命名所有文件.
I have approx 1 millions text files stored in S3 . I want to rename all files based on their folders name.
我如何在 spark-scala 中做到这一点?
How can i do that in spark-scala ?
我正在寻找一些示例代码.
I am looking for some sample code .
我正在使用 zeppelin 运行我的 spark 脚本.
I am using zeppelin to run my spark script .
我按照答案中的建议尝试了以下代码
Below code I have tried as suggested from answer
import org.apache.hadoop.fs._
val src = new Path("s3://trfsmallfffile/FinancialLineItem/MAIN")
val dest = new Path("s3://trfsmallfffile/FinancialLineItem/MAIN/dest")
val conf = sc.hadoopConfiguration // assuming sc = spark context
val fs = Path.getFileSystem(conf)
fs.rename(src, dest)
但低于错误
<console>:110: error: value getFileSystem is not a member of object org.apache.hadoop.fs.Path
val fs = Path.getFileSystem(conf)
推荐答案
你可以使用普通的 HDFS API,比如(输入,未测试)
you can use the normal HDFS APIs, something like (typed in, not tested)
val src = new Path("s3a://bucket/data/src")
val dest = new Path("s3a://bucket/data/dest")
val conf = sc.hadoopConfiguration // assuming sc = spark context
val fs = src.getFileSystem(conf)
fs.rename(src, dest)
S3A 客户端伪造重命名的方式是对每个文件进行 copy+delete
,所以它花费的时间与文件数量和数据量成正比.而 S3 会限制您:如果您尝试并行执行此操作,则可能会减慢您的速度.如果需要一段时间",请不要感到惊讶.
The way the S3A client fakes a rename is a copy + delete
of every file, so the time it takes is proportional to the #of files, and the amount of data. And S3 throttles you: if you try to do this in parallel, it will potentially slow you down. Don't be surprised if it takes "a while".
您还需要按 COPY 通话收费,每 1,000 个通话收费 0.005,因此您将花费约 5 美元来尝试.在一个小目录上进行测试,直到您确定一切正常
You also get billed per COPY call, at 0.005 per 1,000 calls, so it will cost you ~$5 to try. Test on a small directory until you are sure everything is working
这篇关于如何在 spark scala 中重命名 S3 文件而不是 HDFS的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!