spark 2.3.0,镶木地板1.8.2-spark写入生成的文件中不存在二进制字段的统计信息? [英] spark 2.3.0, parquet 1.8.2 - statistics for a binary field does't exist in resulting file from spark write?
本文介绍了spark 2.3.0,镶木地板1.8.2-spark写入生成的文件中不存在二进制字段的统计信息?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
在spark master分支上-我尝试将带有"a","b","c"的单列写入镶木地板文件f1
On spark master branch - I tried to write single column with "a", "b", "c" to parquet file f1
scala> List("a", "b", "c").toDF("field1").coalesce(1).write.parquet("f1")
但是保存的文件没有统计信息(最小,最大)
But saved file does not have statistics (min, max)
$ ls f1/*.parquet
f1/part-00000-445036f9-7a40-4333-8405-8451faa44319-c000.snappy.parquet
$ parquet-tool meta f1/*.parquet
file: file:/Users/stephen/p/spark/f1/part-00000-445036f9-7a40-4333-8405-8451faa44319-c000.snappy.parquet
creator: parquet-mr version 1.8.2 (build c6522788629e590a53eb79874b95f6c3ff11f16c)
extra: org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"field1","type":"string","nullable":true,"metadata":{}}]}
file schema: spark_schema
--------------------------------------------------------------------------------
field1: OPTIONAL BINARY O:UTF8 R:0 D:1
row group 1: RC:3 TS:48 OFFSET:4
--------------------------------------------------------------------------------
field1: BINARY SNAPPY DO:0 FPO:4 SZ:50/48/0.96 VC:3 ENC:BIT_PACKED,RLE,PLAIN ST:[no stats for this column]
任何指针将不胜感激. 谢谢.
Any pointer would be appreciated. Thank you.
推荐答案
在ShowMetaCommand.java
中将parquet.strings.signed-min-max.enabled
设置为true
后,镶木地板工具元显示最小值,最大值.
After setting parquet.strings.signed-min-max.enabled
to true
in ShowMetaCommand.java
, parquet-tools meta show min,max.
@@ -57,8 +57,9 @@ public class ShowMetaCommand extends ArgsOnlyCommand {
String[] args = options.getArgs();
String input = args[0];
Configuration conf = new Configuration();
+ conf.set("parquet.strings.signed-min-max.enabled", "true");
Path inputPath = new Path(input);
FileStatus inputFileStatus = inputPath.getFileSystem(conf).getFileStatus(inputPath);
List<Footer> footers = ParquetFileReader.readFooters(conf, inputFileStatus, false);
结果
row group 1: RC:3 TS:56 OFFSET:4
--------------------------------------------------------------------------------
field1: BINARY SNAPPY DO:0 FPO:4 SZ:56/56/1.00 VC:3 ENC:DELTA_BYTE_ARRAY -- ST:[min: a, max: c, num_nulls: 0]
这篇关于spark 2.3.0,镶木地板1.8.2-spark写入生成的文件中不存在二进制字段的统计信息?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文