如何使用FileSystem API计算分区? [英] How to count partitions with FileSystem API?

查看:86
本文介绍了如何使用FileSystem API计算分区?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Hadoop 2.7版及其FileSystem API.问题是关于如何使用API​​计算分区数?" ,但是,要把它放入软件问题中,我要在这里处理一个Spark-Shell脚本...有关该脚本的具体问题是

I am using Hadoop version 2.7 and its FileSystem API. The question is about "how to count partitions with the API?" but, to put it into a software problem, I am coping here a Spark-Shell script... The concrete question about the script is

下面的变量parts正在计算表分区的数量,还是其他?

The variable parts below is counting the number of table partitions, or other thing?

import org.apache.hadoop.fs.{FileSystem, Path}
import scala.collection.mutable.ArrayBuffer
import spark.implicits._

val warehouse = "/apps/hive/warehouse"  // the Hive default location for all databases
val db_regex  = """\.db$""".r   // filter for names like "*.db"
val tab_regex = """\.hive\-staging_""".r    // signature of Hive files

val trStrange = "[\\s/]+|[^\\x00-\\x7F]+|[\\p{Cntrl}&&[^\r\n\t]]+|\\p{C}+".r //mark
def cutPath (thePath: String, toCut: Boolean = true) : String =
  if (toCut) trStrange.replaceAllIn( thePath.replaceAll("^.+/", ""),  "@") else thePath

val fs_get = FileSystem.get( sc.hadoopConfiguration )
fs_get.listStatus( new Path(warehouse) ).foreach( lsb => {
    val b = lsb.getPath.toString
    if (db_regex.findFirstIn(b).isDefined) 
       fs_get.listStatus( new Path(b) ).foreach( lst => {
            val lstPath = lst.getPath
            val t = lstPath.toString
            var parts = -1
            var size = -1L
            if (!tab_regex.findFirstIn(t).isDefined) {
              try {
                  val pp = fs_get.listStatus( lstPath )
                  parts = pp.length // !HERE! partitions?
                  pp.foreach( p => {
                     try { // SUPPOSING that size is the number of bytes of table t
                        size  = size  + fs.getContentSummary(p.getPath).getLength
                     } catch { case _: Throwable => }
                  })
              } catch { case _: Throwable =>  }
              println(s"${cutPath(b)},${cutPath(t)},$parts,$size")
            }
        })
}) // x warehouse loop
System.exit(0)  // get out from spark-shell

这只是一个显示焦点的示例:使用Hive FileSystem API对 Hive默认数据库 FileSystem结构进行正确的扫描和语义解释.该脚本有时需要一些内存,但工作正常.使用
sshell --driver-memory 12G --executor-memory 18G -i teste_v2.scala > output.csv

This is only an example to show the focus: the correct scan and semantic interpretation of the Hive default database FileSystem structure, using Hive FileSystem API. The script sometimes need some memory, but is working fine. Run with
sshell --driver-memory 12G --executor-memory 18G -i teste_v2.scala > output.csv

注意:此处的目的不是通过任何其他方法(例如HQL DESCRIBE或Spark Schema)对分区进行计数,而是为其使用API​​ ...用于控制和数据质量检查,API作为一种较低级别的测量"非常重要.

Note: the aim here is not to count partitions by any other method (e.g. HQL DESCRIBE or Spark Schema), but to use the API for it... For control and for data quality checks, the API is important as a kind of "lower level measurement".

推荐答案

Hive将其元数据构造为数据库>表>分区>文件.通常,这会转换为文件系统目录结构<hive.warehouse.dir>/database.db/table/partition/.../files.其中/partition/.../表示表可以被多个列分区,从而创建嵌套级别的子目录. (根据惯例,分区是名称为.../partition_column=value目录).

Hive structures its metadata as database > tables > partitions > files. This typically translates into filesystem directory structure <hive.warehouse.dir>/database.db/table/partition/.../files. Where /partition/.../ signifies that tables can be partitioned by multiple columns thus creating a nested levels of subdirectories. (A partition is a directory with the name .../partition_column=value by convention).

因此,如果我没有记错的话,似乎您的脚本将为每个数据库中的每个单列分区表打印文件数(parts)及其总长度(size).

So seems like your script will be printing the number of files (parts) and their total length (size) for each single-column partitioned table in each of your databases, if I'm not mistaken.

或者,我建议您查看

As alternative, I'd suggest you look at hdfs dfs -count command to see if it suits your needs, and maybe wrap it in a simple shell script to loop through the databases and tables.

这篇关于如何使用FileSystem API计算分区?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆