Hive外部表跳过第一行 [英] Hive External Table Skip First Row

查看:834
本文介绍了Hive外部表跳过第一行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用的是Cloudera的Hive版本,并尝试在包含第一列中的列名的csv文件上创建外部表。

  CREATE EXTERNAL TABLE测试(
RecordId int,$ b)这是我使用的代码。 $ b姓名字符串,
姓氏字符串

行格式serde'com.bizo.hive.serde.csv.CSVSerde'
与SerDeProperties(
separatorChar =,

STORED AS TEXTFILE
LOCATION'/user/File.csv'

示例数据

  RecordId,名字,姓氏
1,John, Doe
2,Jane,Doe

任何人都可以帮助我跳过第一行还是需要添加一个中间步骤?

解决方案

数据中的标题行在Hive中是永久的头痛。在修改Hive源代码之前,我相信如果没有中间步骤,就无法脱身。 (编辑:这不再是真实的,见下面的更新)



不幸的是,你回答你的问题。我会在中间步骤中提出一些完整想法。



如果您愿意过滤掉标题,则无需额外的数据加载步骤即可离开在触及表格的每个查询上进行排序。不幸的是,这在其他地方增加了一个额外的设置。当标题行违反了你的模式时,你将不得不变得聪明/杂乱。如果你采用这种方法,你可以考虑编写一个自定义的SerDe,使得这一行更容易过滤。不幸的是,SerDe不能完全移除该行(或者可能形成可能的解决方案),它们必须返回类似于 null 的内容。我从来没有见过这种方法在实践中被用来处理标题行,因为它使阅读变得很痛苦,并且阅读往往比写作更普遍。如果您正在处理一个表或者标题行只是多行格式不正确的行中的一行,它可能会有一个位置。



您可以使用在删除数据加载中的第一行时的变体。 INSERT 语句中的 WHERE 子句可以做到这一点。你可以使用像 sed 这样的工具来摆脱它。我已经看到两种方法。在你采用哪种方法之间存在权衡,也不是处理标题行的真正方法。不幸的是,这两种方法都需要时间并需要临时重复数据。如果您确实需要另一个应用程序的标题行,重复将是永久的。



更新:

从Hive v0.13.0开始,您可以使用skip.header.line.count。您也可以在创建表时指定相同的值。例如:

 创建外部表testtable(名称字符串,消息字符串)
行格式分隔
字段以'\ t'结尾
行以'\ n'结尾
位置'/ testtable'
tblproperties(skip.header.line.count=1);


I am using Cloudera's version of Hive and trying to create an external table over a csv file that contains the column names in the first column. Here is the code that I am using to do that.

CREATE EXTERNAL TABLE Test ( 
  RecordId int, 
  FirstName string, 
  LastName string 
) 
ROW FORMAT serde 'com.bizo.hive.serde.csv.CSVSerde' 
WITH SerDeProperties (  
  "separatorChar" = ","
) 
STORED AS TEXTFILE 
LOCATION '/user/File.csv'

Sample Data

RecordId,FirstName,LastName
1,"John","Doe"
2,"Jane","Doe"

Can anyone help me with how to skip the first row or do I need to add an intermediate step?

解决方案

Header rows in data are a perpetual headache in Hive. Short of modifying the Hive source, I believe you can't get away without an intermediate step. (Edit: This is no longer true, see update below)

Unfortunately, that answers you question. I'll throw in some ideas for the intermediate step for completeness.

You can get away without an extra step in your data load if you are willing to filter out the header row on every query that touches the table. Unfortunately this adds an extra set just about everywhere else. And you will have to get clever/messy when the header row violates your schema. If you go with this approach, you might consider writing a custom SerDe that makes this row easier to filter. Unfortunately, SerDe's cannot remove the row entirely (or that might form a possible solution), they must return something like null. I've never seen this approach taken in practice to deal with header rows since it makes reading a pain, and reading tends to be much more common than writing. It might have a place if you are dealing with one-of tables or if the header row is just one row among many malformed rows.

You could do this filtering once with variations on deleting that first row in data load. A WHERE clause in an INSERT statement would do it. You could use utilities like sed to get rid of it. I've seen both approaches taken. There are trade-offs between which approach you take and neither is the one true way to deal with header rows. Unfortunately, both these approaches take time and require temporary duplication of the data. If you absolutely need the header row for another application, the duplication would be permanent.

Update:

From Hive v0.13.0, you can use skip.header.line.count. You could also specify the same while creating the table. For example:

create external table testtable (name string, message string)
row format delimited 
fields terminated by '\t' 
lines terminated by '\n' 
location '/testtable'
tblproperties ("skip.header.line.count"="1");

这篇关于Hive外部表跳过第一行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆