如何在 hive 表中加载多行列数据?具有换行符的列 [英] How to load multi-line column data in hive table? Columns having new line characters

查看:46
本文介绍了如何在 hive 表中加载多行列数据?具有换行符的列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 Excel 文件中有一个列(不是最后一列),其中包含跨越几行的数据.

I have a column (not the last column) in Excel file that contains data which is spanning over few lines.

列的某些单元格是空白的,而某些单元格具有单行条目.

Some cells of column is blank and some have single lines entries.

当另存为 .CSV 文件或从 excel 中以制表符分隔的 .txt 时,所有多行数据和少量单行条目都用双引号生成,没有任何空白字段用引号引起来.一些单行条目不在引号内.

When saving as .CSV file or a tab separated .txt from excel, all the multi-line data and few single line entries are getting generated in double quotes, None of the blank fields are in quotes. Some of the single line entries are not within quotes.

是否可以将具有相同结构的数据存储在 hive 表中?如果是,如何做到这一点?我知道我需要转义双引号内的所有 LF,并仅将最后一个 LF 作为实际的 EOL 处理.但是当遇到 '\n' 时,Hive 会将数据移到新行.

Is it possible to store the data with this same structure in a hive table? If Yes, how can this be done? I understand I need to escape all the LF within double-quotes and take care of the last LF only as the actual EOL. But the moment a '\n' is encountered, Hive takes data to a new row.

excel数据格式如下:

The format of data in excel is like as below:

|------+------+--------+------------------+-------+------|
|row1: | col1 | col2   | col3(multi-line) | col4  | col5 |
|------+------+--------+------------------+-------+------|
|      |      |        | line 1 of 3      |       |      |
|row2: | abc  | defsa  | line 2 of 3      | bcde  | hft  |
|      |      |        | line 3 of 3      |       |      |
|------+------+--------+------------------+-------+------|
|row3: | abc2 | defsa2 | (blank)          | bcde2 | hft2 |
|------+------+--------+------------------+-------+------|
|row4: | abc3 | defsa3 | single-line1     | bcde3 | hft3 |
|------+------+--------+------------------+-------+------|
|row5: | abc4 | defsa4 | single-line2     | bcde4 | hft4 |
|------+------+--------+------------------+-------+------|

当保存为 CVS 时,输出如下:

When saved as CVS it outputs to the following:

row1--col1,col2,col3(多行),col4,col5
row2--abc,defsa,line 1 of 3",,,,,,
row3--第 2 行,共 3 个,,,,,,
row4--line 3 of 3,,,,,,
row5--",bcde,hft
row6--abc2,defsa2,,bcde2,hft2
row7--abc3,defsa3,single-line1,bcde3,hft3
row8--abc4,defsa4,single-line2",,,,,,
row9--",bcde4,hft4

row1--col1,col2,col3(multi-line),col4,col5
row2--abc,defsa,line 1 of 3",,,,,,
row3--line 2 of 3,,,,,,
row4--line 3 of 3,,,,,,
row5--",bcde,hft
row6--abc2,defsa2,,bcde2,hft2
row7--abc3,defsa3,single-line1,bcde3,hft3
row8--abc4,defsa4,single-line2",,,,,,
row9--",bcde4,hft4

5 行 excel 到 9 行 csv.

5 rows of excel to 9 rows of csv.

感谢将此 .csv 文件中的输入存储到 hive 表中,如果可能,无需更改结构和维护多行列.

Appreciate inputs to store from this .csv file into a hive table, if possible without changing the structure and maintaining the multi-line column.

推荐答案

从此链接,提供的 SerDe 无法处理嵌入的新行.我的猜测是,如果您想要嵌入新行,则必须创建自定义 SerDe.无需深入研究,这个是一个很好的资源可能有助于创建自定义 SerDe.

From this link, the provided SerDe cannot handle embedded new lines. My guess is that if you want embedded new lines, you will have to create a custom SerDe. Without looking too deeply into it, this is a good resource that might help in creating a custom SerDe.

在将数据加载到 Hive 之前,您是否尝试使用 Pig 处理数据,例如在将 \n 字符移到 Hive 之前,您可以用其他字符替换它.但是您可能会遇到同样的问题,即无法准确地将其加载到 Pig 中,因为它可能使用相同的 SerDe.

Have you tried using Pig to process the data before loading it to Hive, e.g. you could substitute the \n char with something else before moving it to Hive. But you might run into the same problem of not being able to load it into Pig accurately since it's probably using the same SerDe.

最终,自定义 SerDe 将解决您的问题,但可能还有另一种我没有看到的更简单的方法.

Ultimately, a custom SerDe WILL solve your problem, but there might be another easier way I'm not seeing.

这篇关于如何在 hive 表中加载多行列数据?具有换行符的列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆