配置单元:创建表和分区依据 [英] Hive: Create Table and Partition By

查看:210
本文介绍了配置单元:创建表和分区依据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

  create table xyzlogTable(dateC string,hours string,minutes string,秒字符串,TimeTaken字符串,方法字符串,UriQuery字符串,ProtocolStatus字符串)行格式serde'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'with serdeproperties(input.regex=(\\S +) \\t(\\d +):( \\d +):( \\d +)\\t(\\S +)\\t(\\ S +)\\ t(\\ S +)\\\\\\\\\\\\\\\\\\\\\\\\' 4 $ s%5 $ s%6 $ s%7 $ s%8 $ s)存储为文本文件; 

将数据本地inpath'/ home / hadoop / hive / xyxlogData /'载入表xyxlogTable;

总行数超过3百万。一些查询工作正常,有些进入无限循环。



在看到 select,group by 查询花费很长时间并且有时甚至不返回结果,决定去分区。



但是下面的语句都失败了:

 <$ (日期C字符串)分区的表xyzlogTable(datenonQuery字符串,小时字符串,分钟字符串,秒字符串,TimeTaken字符串,方法字符串,UriQuery字符串,ProtocolStatus字符串) 




FAILED:元数据错误:AlreadyExistsException(消息:表xyzlogTable已存在)
FAILED:执行错误,从org.apache.hadoop.hive.ql.exec.DDLTask返回代码1




<$ p (dateC字符串)分隔的xyzlogTable(datenonQuery字符串,小时字符串,分钟字符串,秒字符串,TimeTaken字符串,方法字符串,UriQuery字符串,ProtocolStatus字符串); $ p>




FAILED:解析错误:行1:12无法识别输入'xyzlogTable 'in alter table statement


任何想法是什么问题!

解决方案

这正是我喜欢在Hive中使用外部表的原因。您创建的表不是外部的(您使用 create table 而不是 create external table )。使用非外部表格,删除表格,在HDFS中删除元数据(名称,列名称,类型等)和表中的数据。相反,当一个外部表被删除时,只有元数据被删除,HDFS中的数据仍然存在。



你有几个选择:


  1. 如果导入的成本很高并且数据尚未分区。保留这张表,但创建一个新表,表示xyzlogTable_partitioned,它将成为此表的分区版本。您可以在Hive中使用动态分区来填充此新表。


  2. 如果导入成本高但数据已分区;例如,假设您已经在HDFS中的每个分区的单独文件中有数据。创建一个新的分区表,并有一个bash脚本(或等价物),从与未分区表相对应的HDFS目录移动(或复制,稍后删除,如果您保守的话),然后删除到与新分区对应的目录表。


  3. 如果导入便宜:删除整个表。重新创建一个新的分区表并重新导入。很多时候,如果导入过程没有意识到分区模式(换句话说,如果导入不能直接将数据推入适当的分区),那么通常使用一个未分区的表(就像您已经拥有的表一样)作为临时表,然后使用Hive查询或动态分区来填充新的分区表,以便在随后的工作流查询中使用它。



I have a table with loaded data as following:

create table xyzlogTable (dateC string , hours string, minutes string, seconds string, TimeTaken string, Method string, UriQuery string, ProtocolStatus string) row format serde 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' with serdeproperties( "input.regex" = "(\\S+)\\t(\\d+):(\\d+):(\\d+)\\t(\\S+)\\t(\\S+)\\t(\\S+)\\t(\\S+)", "output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s") stored as textfile;

load data local inpath '/home/hadoop/hive/xyxlogData/' into table xyxlogTable;

total row count is found to be more than 3 million. some queries work fine and some get into infinite loop.

after seeing that select, group by queries taking long time and sometimes not even returning results, decided to go for partitioning.

But both the following statements are failing:

create table xyzlogTable (datenonQuery string , hours string, minutes string, seconds string, TimeTaken string, Method string, UriQuery string, ProtocolStatus string) partitioned by (dateC string); 

FAILED: Error in metadata: AlreadyExistsException(message:Table xyzlogTable already exists) FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask

Alter table xyzlogTable (datenonQuery string , hours string, minutes string, seconds string, TimeTaken string, Method string, UriQuery string, ProtocolStatus string) partitioned by (dateC string);

FAILED: Parse Error: line 1:12 cannot recognize input 'xyzlogTable' in alter table statement

Any idea whats the problem!

解决方案

This is precisely why I prefer using external tables in Hive. The table you created is not external (you used create table instead of create external table). With non-external tables, dropping the table, drops the metadata (name, column names, types, etc.) and the data of the table in HDFS. On the contrary, when an external table is dropped, only the metadata is removed, the data in HDFS sticks around.

You have a few options going forward:

  1. If the cost of import is high and the data is already not partitioned. Keep this table around but create a new table say xyzlogTable_partitioned that will be a partitioned version of this table. You can use Dynamic Partitioning in Hive to populate this new table.

  2. If the cost of import is high but the data is already partitioned; for example say you already have data in separate files for each partition in HDFS. Create a new partitioned table and have a bash script (or equivalent), move (or copy and later delete, if you are conservative) from the HDFS directory corresponding to the un-partitioned table to the directory corresponding to the appropriate partition of the new table.

  3. If import is cheap: drop the entire table. Re-create a new partitioned table and re-import. Many times if the import process is not aware of the partitioning schema (in other words, if the import can't directly push data into appropriate partitions), it's a common use case to have an unpartitioned table (like the one you already have) as a staging table and then use a Hive query or dynamic partitioning to populate a new partitioned table which gets used in subsequent queries of the workflow.

这篇关于配置单元:创建表和分区依据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆