在EMR中使用Spark Scala获取S3对象大小(文件夹,文件) [英] Using Spark Scala in EMR to get S3 Object size (folder, files)

查看:105
本文介绍了在EMR中使用Spark Scala获取S3对象大小(文件夹,文件)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从命令行EMR中使用 scala 获取某些S3文件夹的文件夹大小.

I am trying to get the folder size for some S3 folders with scala from my command line EMR.

我在S3中将JSON数据存储为GZ文件.我发现我可以计算文件中JSON记录的数量:

I have JSON data stored as GZ files in S3. I find I can count the number of JSON records within my files:

spark.read.json("s3://mybucket/subfolder/subsubfolder/").count

但是现在我需要知道数据占多少GB.

But now I need to know how much GB that data accounts for.

我正在寻找一些选项来获取不同文件的大小,而不是整个文件夹的大小.

I am finding options to get the size for distinct files, but not for a whole folder all up.

推荐答案

我正在寻找一些选项来获取不同文件的大小,但找不到整个文件夹全部保存起来.

I am finding options to get the size for distinct files, but not for a whole folder all up.

解决方案:

选项1:

Option1:

通过FileSystem获取s3访问权限

Get the s3 access by FileSystem

    val fs = FileSystem.get(new URI(ipPath), spark.sparkContext.hadoopConfiguration)

注意:

1)新URI 非常重要,否则它将连接到hadoop文件系统路径s3文件系统(对象库:-))路径的读入.使用新的URI,您可以给方案 s3://在这里.

1) new URI is important other wise it will connect to hadoop file system path instread of s3 file system(object store :-)) path . using new URI you are giving scheme s3:// here.

2) org.apache.commons.io.FileUtils.byteCountToDisplaySize 将以GB MB等给出文件系统的显示大小...

2) org.apache.commons.io.FileUtils.byteCountToDisplaySize will give display sizes of file system in GB MB etc...

      /**
    * recursively print file sizes
    *
    * @param filePath
    * @param fs
    * @return
    */
@throws[FileNotFoundException]
@throws[IOException]
  def getDisplaysizesOfS3Files(filePath: org.apache.hadoop.fs.Path, fs: org.apache.hadoop.fs.FileSystem): scala.collection.mutable.ListBuffer[String] = {
    val fileList = new scala.collection.mutable.ListBuffer[String]
    val fileStatus = fs.listStatus(filePath)
    for (fileStat <- fileStatus) {
      println(s"file path Name : ${fileStat.getPath.toString} length is  ${fileStat.getLen}")
      if (fileStat.isDirectory) fileList ++= (getDisplaysizesOfS3Files(fileStat.getPath, fs))
      else if (fileStat.getLen > 0 && !fileStat.getPath.toString.isEmpty) {
        println("fileStat.getPath.toString" + fileStat.getPath.toString)
        fileList += fileStat.getPath.toString
        val size = fileStat.getLen
        val display = org.apache.commons.io.FileUtils.byteCountToDisplaySize(size)
        println(" length zero files \n " + fileStat)
        println("Name    = " + fileStat.getPath().getName());
        println("Size    = " + size);
        println("Display = " + display);
      } else if (fileStat.getLen == 0) {
        println(" length zero files \n " + fileStat)

      }
    }
    fileList
  }

根据您的要求,您可以修改代码...您可以汇总所有不同的文件.

based on your requirement, you can modify the code... you can sum up all the distinct files.

选项2 :使用 getContentSummary

implicit val spark = SparkSession.builder().appName("ObjectSummary").getOrCreate()
  /**
    * getDisplaysizesOfS3Files 
    * @param path
    * @param spark [[org.apache.spark.sql.SparkSession]]
    */
  def getDisplaysizesOfS3Files(path: String)( implicit spark: org.apache.spark.sql.SparkSession): Unit = {
    val filePath = new org.apache.hadoop.fs.Path(path)
    val fileSystem = filePath.getFileSystem(spark.sparkContext.hadoopConfiguration)
    val size = fileSystem.getContentSummary(filePath).getLength
    val display = org.apache.commons.io.FileUtils.byteCountToDisplaySize(size)
    println("path    = " + path);
    println("Size    = " + size);
    println("Display = " + display);
  } 

注意:以上显示的任何选项均适用于本地或hdfs或s3以及

这篇关于在EMR中使用Spark Scala获取S3对象大小(文件夹,文件)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆