如何在Hive表中加载多行列数据?具有换行符的列 [英] How to load multi-line column data in hive table? Columns having new line characters

查看:1186
本文介绍了如何在Hive表中加载多行列数据?具有换行符的列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在Excel文件中有一列(不是最后一列),其中包含跨越几行的数据.

I have a column (not the last column) in Excel file that contains data which is spanning over few lines.

该列的某些单元格为空白,某些单元格为单行条目.

Some cells of column is blank and some have single lines entries.

当另存为.CSV文件或与excel分开的.txt的制表符时,所有多行数据和少量单行条目都将用双引号引起来,而所有空白字段都不会用引号引起来.有些单行条目不在引号内.

When saving as .CSV file or a tab separated .txt from excel, all the multi-line data and few single line entries are getting generated in double quotes, None of the blank fields are in quotes. Some of the single line entries are not within quotes.

是否可以将具有相同结构的数据存储在配置单元表中?如果是,该怎么办? 我知道我需要在双引号中转义所有LF,并仅将最后一个LF作为实际的EOL来处理.但是,一旦遇到"\ n",Hive就会将数据移到新行.

Is it possible to store the data with this same structure in a hive table? If Yes, how can this be done? I understand I need to escape all the LF within double-quotes and take care of the last LF only as the actual EOL. But the moment a '\n' is encountered, Hive takes data to a new row.

excel中的数据格式如下:

The format of data in excel is like as below:

|------+------+--------+------------------+-------+------|
|row1: | col1 | col2   | col3(multi-line) | col4  | col5 |
|------+------+--------+------------------+-------+------|
|      |      |        | line 1 of 3      |       |      |
|row2: | abc  | defsa  | line 2 of 3      | bcde  | hft  |
|      |      |        | line 3 of 3      |       |      |
|------+------+--------+------------------+-------+------|
|row3: | abc2 | defsa2 | (blank)          | bcde2 | hft2 |
|------+------+--------+------------------+-------+------|
|row4: | abc3 | defsa3 | single-line1     | bcde3 | hft3 |
|------+------+--------+------------------+-------+------|
|row5: | abc4 | defsa4 | single-line2     | bcde4 | hft4 |
|------+------+--------+------------------+-------+------|

另存为CVS时,它输出到以下内容:

When saved as CVS it outputs to the following:

行1--col1,col2,col3(多行),col4,col5
第2行-abc,defsa,第1行,共3英寸,",,
第3行,共2行,第3行、、、、、
第4行第3行,共3行、、、、、
row5-,bcde,hft
第6行-abc2,defsa2,bcde2,hft2
第7行-abc3,defsa3,单行1,bcde3,hft3
row8--abc4,defsa4,single-line2",,,,,
row9-,bcde4,hft4

row1--col1,col2,col3(multi-line),col4,col5
row2--abc,defsa,line 1 of 3",,,,,,
row3--line 2 of 3,,,,,,
row4--line 3 of 3,,,,,,
row5--",bcde,hft
row6--abc2,defsa2,,bcde2,hft2
row7--abc3,defsa3,single-line1,bcde3,hft3
row8--abc4,defsa4,single-line2",,,,,,
row9--",bcde4,hft4

5行excel到9行csv.

5 rows of excel to 9 rows of csv.

在不更改结构和保持多行列的情况下,尽可能地将输入内容从此.csv文件存储到配置单元表中.

Appreciate inputs to store from this .csv file into a hive table, if possible without changing the structure and maintaining the multi-line column.

推荐答案

从这里链接,则提供的SerDe无法处理嵌入的新行.我的猜测是,如果要嵌入新行,则必须创建一个自定义SerDe.不用太深入研究,是一个很好的资源,可能有助于创建自定义SerDe.

From this link, the provided SerDe cannot handle embedded new lines. My guess is that if you want embedded new lines, you will have to create a custom SerDe. Without looking too deeply into it, this is a good resource that might help in creating a custom SerDe.

您是否尝试过使用Pig处理数据,然后再将其加载到Hive,例如您可以将\ n字符替换为其他内容,然后再将其移至Hive.但是您可能会遇到同样的问题,因为它可能使用相同的SerDe,因此无法将其准确地加载到Pig中.

Have you tried using Pig to process the data before loading it to Hive, e.g. you could substitute the \n char with something else before moving it to Hive. But you might run into the same problem of not being able to load it into Pig accurately since it's probably using the same SerDe.

最终,一个自定义的SerDe可以解决您的问题,但是可能还有另一种我看不到的简便方法.

Ultimately, a custom SerDe WILL solve your problem, but there might be another easier way I'm not seeing.

这篇关于如何在Hive表中加载多行列数据?具有换行符的列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆