Hive-Varchar vs String,如果存储格式为Parquet文件格式,有什么好处吗? [英] Hive - Varchar vs String , Is there any advantage if the storage format is Parquet file format

查看:928
本文介绍了Hive-Varchar vs String,如果存储格式为Parquet文件格式,有什么好处吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个HIVE表,该表将保存数十亿条记录,它是一个时间序列数据,因此分区是每分钟一次.每分钟我们将有大约100万条记录.

I have a HIVE table which will hold billions of records, its a time-series data so the partition is per minute. Per minute we will have around 1 million records.

我表中的字段很少,VIN码(17个字符),Status(2个字符)...等等

I have few fields in my table, VIN number (17 chars), Status (2 chars) ... etc

所以我的问题是在表创建期间,如果我选择使用Varchar(X)vs String,是否存在任何存储或性能问题,

So my question is during the table creation if I choose to use Varchar(X) vs String, is there any storage or performance problem,

varchar的一些限制是 https://cwiki.apache.org/confluence/display/Hive/LanguageManual + Types#LanguageManualTypes-字符串

Few limitation of varchar are https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-string

  1. 如果我们提供的字符数超过"x",则会自动截断,因此 保持字符串将是未来的证明.

  1. If we provide more than "x" characters it will silently truncate, so keeping it string will be future proof.

  1. 非通用UDF不能直接使用varchar类型作为输入参数 或返回值.可以改为创建字符串UDF,并且 varchar值将转换为字符串并传递给UDF. 要直接使用varchar参数或返回varchar值, 创建一个GenericUDF.

  1. Non-generic UDFs cannot directly use varchar type as input arguments or return values. String UDFs can be created instead, and the varchar values will be converted to strings and passed to the UDF. To use varchar arguments directly or to return varchar values, create a GenericUDF.

可能存在其他不支持varchar的上下文,如果它们 依靠基于反射的方法来检索类型信息. 这包括一些SerDe实现.

There may be other contexts which do not support varchar, if they rely on reflection-based methods for retrieving type information. This includes some SerDe implementations.

就存储和性能而言,我需要使用字符串而不是varchar支付什么费用

What is the cost I have to pay for using string instead of varchar in terms of storage and performance

推荐答案

让我们尝试了解如何在API中实现它:-

Lets try to understand from how it is implemented in API:-

org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter 

这是魔法开始的地方->

Here is the magic begins -->

private DataWriter createWriter(ObjectInspector inspector, Type type) {
case stmt.....
........
case STRING:
        return new StringDataWriter((StringObjectInspector)inspector);
    case VARCHAR:
        return new VarcharDataWriter((HiveVarcharObjectInspector)inspector);

}

DataWritableWriter类的

createWriter方法检查列的数据类型.即varcharstring,因此会为这些类型创建writer类.

createWriter method of DataWritableWriter class checks for datatype of column. i.e. either varchar or string, accordingly it creates writer class for these types.

现在,让我们继续学习VarcharDataWriter类.

Now lets move on to VarcharDataWriter class.

private class VarcharDataWriter implements DataWriter {
    private HiveVarcharObjectInspector inspector;

    public VarcharDataWriter(HiveVarcharObjectInspector inspector) {
      this.inspector = inspector;
    }

    @Override
    public void write(Object value) {
      String v = inspector.getPrimitiveJavaObject(value).getValue();
      recordConsumer.addBinary(Binary.fromString(v));
    }
  }

OR

StringDataWriter

private class StringDataWriter implements DataWriter {
    private StringObjectInspector inspector;

    public StringDataWriter(StringObjectInspector inspector) {
      this.inspector = inspector;
    }

    @Override
    public void write(Object value) {
      String v = inspector.getPrimitiveJavaObject(value);
      recordConsumer.addBinary(Binary.fromString(v));
    }
  }

这两个类中的

addBinary 方法实际上添加了已编码数据类型(encodeUTF8编码)的二进制值.并且对于字符串编码与对varchar的编码不同.

addBinary method in both the classes actually adds binary values of encoded datatype(encodeUTF8 encoding). And for string encoding is different than encoding of varchar.

简短的问题答案:-字符串和varchar的unicode编码不同.明智的存储方式可能没有多少变化.的存储字节数.但是根据我的理解,性能是明智的,配置单元是schema on read工具. ParquetRecordReader知道如何读取记录.它只读取字节.因此,由于varchar或字符串数​​据类型,不会有任何性能差异.

short answer to question:- unicode encoding of string and varchar are different. storage wise it may little vary as per no. of bytes of store. But performance wise as per my understanding, hive is schema on read tool. ParquetRecordReader knows how to read a record. It just reads bytes.So there wont be any performance difference due to varchar or string datatype.

这篇关于Hive-Varchar vs String,如果存储格式为Parquet文件格式,有什么好处吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆