Spark的int96时间类型 [英] Spark's int96 time type

查看:1264
本文介绍了Spark的int96时间类型的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当您在spark中创建时间戳列并保存到parquet时,您将获得一个12字节的整数列类型(int96);我收集的数据被分为6字节的朱利安日和6个字节的纳秒在一天内。



这不符合任何镶木地板逻辑类型。我们的问题是,Spark知道如何加载这样的一个字符串,然后把这个列指示为除整数之外的任何东西。



列作为时间戳而不是一个大整数?

解决方案

语义是基于元数据确定的。我们需要一些导入:

  import org.apache.parquet.hadoop.ParquetFileReader 
import org.apache.hadoop.fs。{FileSystem,Path}
import org.apache.hadoop.conf.Configuration

示例数据:

  val path =/ tmp / ts

Seq((1,2017-03-06 10:00:00))toDF(id,ts)
.withColumn(ts ,$ts.cast(timestamp))
.write.mode(覆盖)parquet(path)

和Hadoop配置:

  val conf = spark。 sparkContext.hadoopConfiguration 
val fs = FileSystem.get(conf)

现在我们可以访问Spark元数据:

  ParquetFileReader 
.readAllFootersInParallel(conf,fs.getFileStatus(new Path路径)))
.get(0)
.getPa rquetMetadata
.getFileMetaData
.getKeyValueMetaData
.get(org.apache.spark.sql.parquet.row.metadata)

,结果是:

  String = {type:struct,fields:[
{name:id,type:integer,nullable:false,metadata:{}} ,
{name:ts,type:timestamp,nullable:true,metadata:{}}]}
pre>

等效信息也可以存储在Metastore中。



根据官方文档实现与Hive和Impala的兼容性:


某些实际生产系统,特别是Impala和Hive,将Timestamp存储到INT96中。该标志告诉Spark SQL将INT96数据解析为时间戳以提供与这些系统的兼容性。


,可以使用 spark.sql.parquet.int96AsTimestamp 属性进行控制。


When you create a timestamp column in spark, and save to parquet, you get a 12 byte integer column type (int96); I gather the data is split into 6-bytes for Julian day and 6 bytes for nanoseconds within the day.

This does not conform to any parquet logical type. The schema in the parquet file does not, then, give an indication of the column being anything but an integer.

My question is, how does Spark know to load such a column as a timestamp as opposed to a big integer?

解决方案

Semantics is determined based on the metadata. We'll need some imports:

import org.apache.parquet.hadoop.ParquetFileReader
import org.apache.hadoop.fs.{FileSystem, Path}
import org.apache.hadoop.conf.Configuration

example data:

val path = "/tmp/ts"

Seq((1, "2017-03-06 10:00:00")).toDF("id", "ts")
  .withColumn("ts", $"ts".cast("timestamp"))
  .write.mode("overwrite").parquet(path)

and Hadoop configuration:

val conf = spark.sparkContext.hadoopConfiguration
val fs = FileSystem.get(conf)

Now we can access Spark metadata:

ParquetFileReader
  .readAllFootersInParallel(conf, fs.getFileStatus(new Path(path)))
  .get(0)
  .getParquetMetadata
  .getFileMetaData
  .getKeyValueMetaData
  .get("org.apache.spark.sql.parquet.row.metadata")

and the result is:

String = {"type":"struct","fields: [
  {"name":"id","type":"integer","nullable":false,"metadata":{}},
  {"name":"ts","type":"timestamp","nullable":true,"metadata":{}}]}

Equivalent information can be stored in the Metastore as well.

According to the official documentation this is used to achieve compatibility with Hive and Impala:

Some Parquet-producing systems, in particular Impala and Hive, store Timestamp into INT96. This flag tells Spark SQL to interpret INT96 data as a timestamp to provide compatibility with these systems.

and can be controlled using spark.sql.parquet.int96AsTimestamp property.

这篇关于Spark的int96时间类型的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆