为什么`Plan_Dictionary`编码的字典页偏移量为0？ [英] Why is dictionary page offset 0 for `plain_dictionary` encoding?

查看：35 发布时间：2022/5/11 21:52:44 parquet arrows pyarrow parquet-mr

本文介绍了为什么`Plan_Dictionary`编码的字典页偏移量为0？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

镶木地板是由Spark v2.4镶木地板-MR v1.10生成的

n = 10000
x = [1.0, 2.0, 3.0, 4.0, 5.0, 5.0, None] * n
y = [u'é', u'é', u'é', u'é', u'a', None, u'a'] * n

z = np.random.rand(len(x)).tolist()
dfs = spark.createDataFrame(zip(x, y, z), schema=StructType([StructField('x', DoubleType(),True),StructField('y', StringType(), True),StructField('z', DoubleType(), False)]))
dfs.repartition(1).write.mode('overwrite').parquet('test_spark.parquet')

使用parquet-toolsv1.12检查

row group 0 
--------------------------------------------------------------------------------
x:  DOUBLE SNAPPY DO:0 FPO:4 SZ:1632/31635/19.38 VC:70000 ENC:RLE,BIT_PACKED,PLAIN_DICTIONARY ST:[min: 1.0, max: 5.0, num_nulls: 10000]
y:  BINARY SNAPPY DO:0 FPO:1636 SZ:864/16573/19.18 VC:70000 ENC:RLE,BIT_PACKED,PLAIN_DICTIONARY ST:[min: a, max: é, num_nulls: 10000]
z:  DOUBLE SNAPPY DO:0 FPO:2500 SZ:560097/560067/1.00 VC:70000 ENC:PLAIN,BIT_PACKED ST:[min: 2.0828331581679294E-7, max: 0.9999892375625329, num_nulls: 0]

    x TV=70000 RL=0 DL=1 DS: 5 DE:PLAIN_DICTIONARY
    ----------------------------------------------------------------------------
    page 0:                   DLE:RLE RLE:BIT_PACKED VLE:PLAIN_DICTIONARY ST:[min: 1.0, max: 5.0, num_nulls: 10000] SZ:31514 VC:70000

    y TV=70000 RL=0 DL=1 DS: 2 DE:PLAIN_DICTIONARY
    ----------------------------------------------------------------------------
    page 0:                   DLE:RLE RLE:BIT_PACKED VLE:PLAIN_DICTIONARY ST:[min: a, max: é, num_nulls: 10000] SZ:16514 VC:70000

    z TV=70000 RL=0 DL=0
    ----------------------------------------------------------------------------
    page 0:                   DLE:BIT_PACKED RLE:BIT_PACKED VLE:PLAIN ST:[min: 2.0828331581679294E-7, max: 0.9999892375625329, num_nulls: 0] SZ:560000 VC:70000

问题：

fpo(第一个数据页偏移量)应该始终大于还是小于Do(字典页偏移量)？我从某个地方读到，词典页存储在数据页之后。

对于x&；y，plain_dictionary用于编码。但是，为什么这两列的词典偏移量都为0？

如果我使用PARQUET-CPP v1.5.1的pyrow v0.11.1进行检查，它会告诉我has_dictionary_page: False&；dictionary_page_offset: None

它是否有词典页面？

推荐答案

第一个数据页的偏移量始终大于字典的偏移量。换句话说，词典是第一位的，只有数据页才是第一位的。有两个元数据字段用于存储这些偏移量：dictionary_page_offset(又名DO)和data_page_offset(又名fbo)。遗憾的是，PARQUET-MR没有正确填写这些元数据字段

例如，如果词典从偏移量1000开始，第一个数据页从偏移量2000开始，则正确的值应为：

dictionary_page_offset=1000
data_page_offset=2000

取而代之的是拼花先生商店

dictionary_page_offset=0
data_page_offset=1000

应用于您的示例，这意味着尽管拼花工具显示DO: 0，但是列x和y仍然是字典编码的(列z不是)。

值得一提的是，Impala正确遵循规范，因此您不能依赖于每个文件都有此缺陷。

阅读过程中，拼图先生就是这样处理这种情况的：

// TODO: this should use getDictionaryPageOffset() but it isn't reliable.
if (f.getPos() != meta.getStartingPos()) {
  f.seek(meta.getStartingPos());
}

其中getStartingPos定义为：

/**
 * @return the offset of the first byte in the chunk
 */
public long getStartingPos() {
  long dictionaryPageOffset = getDictionaryPageOffset();
  long firstDataPageOffset = getFirstDataPageOffset();
  if (dictionaryPageOffset > 0 && dictionaryPageOffset < firstDataPageOffset) {
    // if there's a dictionary and it's before the first data page, start from there
    return dictionaryPageOffset;
  }
  return firstDataPageOffset;
}

您可以在以下上下文中看到这些代码行：ParquetFileReader.readDictionary，ColumnChunkMetaData.getStartingPos。

这篇关于为什么`Plan_Dictionary`编码的字典页偏移量为0？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

为什么`Plan_Dictionary`编码的字典页偏移量为0？ [英] Why is dictionary page offset 0 for `plain_dictionary` encoding?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

为什么`Plan_Dictionary`编码的字典页偏移量为0？ [英] Why is dictionary page offset 0 for `plain_dictionary` encoding?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭