如何使用FileSystem API计算分区? [英] How to count partitions with FileSystem API?

查看：86 发布时间：2020/11/6 4:09:18 hadoop hive filesystems

本文介绍了如何使用FileSystem API计算分区?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在使用Hadoop 2.7版及其FileSystem API.问题是关于如何使用API计算分区数?" ，但是，要把它放入软件问题中，我要在这里处理一个Spark-Shell脚本...有关该脚本的具体问题是

I am using Hadoop version 2.7 and its FileSystem API. The question is about "how to count partitions with the API?" but, to put it into a software problem, I am coping here a Spark-Shell script... The concrete question about the script is

下面的变量parts正在计算表分区的数量，还是其他?

The variable parts below is counting the number of table partitions, or other thing?

import org.apache.hadoop.fs.{FileSystem, Path}
import scala.collection.mutable.ArrayBuffer
import spark.implicits._

val warehouse = "/apps/hive/warehouse"  // the Hive default location for all databases
val db_regex  = """\.db$""".r   // filter for names like "*.db"
val tab_regex = """\.hive\-staging_""".r    // signature of Hive files

val trStrange = "[\\s/]+|[^\\x00-\\x7F]+|[\\p{Cntrl}&&[^\r\n\t]]+|\\p{C}+".r //mark
def cutPath (thePath: String, toCut: Boolean = true) : String =
  if (toCut) trStrange.replaceAllIn( thePath.replaceAll("^.+/", ""),  "@") else thePath

val fs_get = FileSystem.get( sc.hadoopConfiguration )
fs_get.listStatus( new Path(warehouse) ).foreach( lsb => {
    val b = lsb.getPath.toString
    if (db_regex.findFirstIn(b).isDefined) 
       fs_get.listStatus( new Path(b) ).foreach( lst => {
            val lstPath = lst.getPath
            val t = lstPath.toString
            var parts = -1
            var size = -1L
            if (!tab_regex.findFirstIn(t).isDefined) {
              try {
                  val pp = fs_get.listStatus( lstPath )
                  parts = pp.length // !HERE! partitions?
                  pp.foreach( p => {
                     try { // SUPPOSING that size is the number of bytes of table t
                        size  = size  + fs.getContentSummary(p.getPath).getLength
                     } catch { case _: Throwable => }
                  })
              } catch { case _: Throwable =>  }
              println(s"${cutPath(b)},${cutPath(t)},$parts,$size")
            }
        })
}) // x warehouse loop
System.exit(0)  // get out from spark-shell

这只是一个显示焦点的示例:使用Hive FileSystem API对 Hive默认数据库 FileSystem结构进行正确的扫描和语义解释.该脚本有时需要一些内存，但工作正常.使用
sshell --driver-memory 12G --executor-memory 18G -i teste_v2.scala > output.csv

This is only an example to show the focus: the correct scan and semantic interpretation of the Hive default database FileSystem structure, using Hive FileSystem API. The script sometimes need some memory, but is working fine. Run with
sshell --driver-memory 12G --executor-memory 18G -i teste_v2.scala > output.csv

注意:此处的目的不是通过任何其他方法(例如HQL DESCRIBE或Spark Schema)对分区进行计数，而是为其使用API ...用于控制和数据质量检查，API作为一种较低级别的测量"非常重要.

Note: the aim here is not to count partitions by any other method (e.g. HQL DESCRIBE or Spark Schema), but to use the API for it... For control and for data quality checks, the API is important as a kind of "lower level measurement".

如何使用FileSystem API计算分区? [英] How to count partitions with FileSystem API?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何使用FileSystem API计算分区? [英] How to count partitions with FileSystem API?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭