如何使用FileSystem API计算分区? [英] How to count partitions with FileSystem API?
问题描述
我正在使用Hadoop 2.7版及其FileSystem API.问题是关于如何使用API计算分区数?" ,但是,要把它放入软件问题中,我要在这里处理一个Spark-Shell脚本...有关该脚本的具体问题是
I am using Hadoop version 2.7 and its FileSystem API. The question is about "how to count partitions with the API?" but, to put it into a software problem, I am coping here a Spark-Shell script... The concrete question about the script is
下面的变量parts
正在计算表分区的数量,还是其他?
The variable parts
below is counting the number of table partitions, or other thing?
import org.apache.hadoop.fs.{FileSystem, Path}
import scala.collection.mutable.ArrayBuffer
import spark.implicits._
val warehouse = "/apps/hive/warehouse" // the Hive default location for all databases
val db_regex = """\.db$""".r // filter for names like "*.db"
val tab_regex = """\.hive\-staging_""".r // signature of Hive files
val trStrange = "[\\s/]+|[^\\x00-\\x7F]+|[\\p{Cntrl}&&[^\r\n\t]]+|\\p{C}+".r //mark
def cutPath (thePath: String, toCut: Boolean = true) : String =
if (toCut) trStrange.replaceAllIn( thePath.replaceAll("^.+/", ""), "@") else thePath
val fs_get = FileSystem.get( sc.hadoopConfiguration )
fs_get.listStatus( new Path(warehouse) ).foreach( lsb => {
val b = lsb.getPath.toString
if (db_regex.findFirstIn(b).isDefined)
fs_get.listStatus( new Path(b) ).foreach( lst => {
val lstPath = lst.getPath
val t = lstPath.toString
var parts = -1
var size = -1L
if (!tab_regex.findFirstIn(t).isDefined) {
try {
val pp = fs_get.listStatus( lstPath )
parts = pp.length // !HERE! partitions?
pp.foreach( p => {
try { // SUPPOSING that size is the number of bytes of table t
size = size + fs.getContentSummary(p.getPath).getLength
} catch { case _: Throwable => }
})
} catch { case _: Throwable => }
println(s"${cutPath(b)},${cutPath(t)},$parts,$size")
}
})
}) // x warehouse loop
System.exit(0) // get out from spark-shell
这只是一个显示焦点的示例:使用Hive FileSystem API对 Hive默认数据库 FileSystem结构进行正确的扫描和语义解释.该脚本有时需要一些内存,但工作正常.使用
sshell --driver-memory 12G --executor-memory 18G -i teste_v2.scala > output.csv
This is only an example to show the focus: the correct scan and semantic interpretation of the Hive default database FileSystem structure, using Hive FileSystem API. The script sometimes need some memory, but is working fine. Run with sshell --driver-memory 12G --executor-memory 18G -i teste_v2.scala > output.csv
注意:此处的目的不是通过任何其他方法(例如HQL DESCRIBE
或Spark Schema)对分区进行计数,而是为其使用API ...用于控制和数据质量检查,API作为一种较低级别的测量"非常重要.
Note: the aim here is not to count partitions by any other method (e.g. HQL DESCRIBE
or Spark Schema), but to use the API for it... For control and for data quality checks, the API is important as a kind of "lower level measurement".
推荐答案
Hive将其元数据构造为数据库>表>分区>文件.通常,这会转换为文件系统目录结构<hive.warehouse.dir>/database.db/table/partition/.../files
.其中/partition/.../
表示表可以被多个列分区,从而创建嵌套级别的子目录. (根据惯例,分区是名称为.../partition_column=value
的目录).
Hive structures its metadata as database > tables > partitions > files. This typically translates into filesystem directory structure <hive.warehouse.dir>/database.db/table/partition/.../files
. Where /partition/.../
signifies that tables can be partitioned by multiple columns thus creating a nested levels of subdirectories. (A partition is a directory with the name .../partition_column=value
by convention).
因此,如果我没有记错的话,似乎您的脚本将为每个数据库中的每个单列分区表打印文件数(parts
)及其总长度(size
).
So seems like your script will be printing the number of files (parts
) and their total length (size
) for each single-column partitioned table in each of your databases, if I'm not mistaken.
As alternative, I'd suggest you look at hdfs dfs -count
command to see if it suits your needs, and maybe wrap it in a simple shell script to loop through the databases and tables.
这篇关于如何使用FileSystem API计算分区?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!