来自 CSV 的 Hive 表.引号中的行终止 [英] Hive table from CSV. The line termination in quotes

查看:17
本文介绍了来自 CSV 的 Hive 表.引号中的行终止的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我尝试从保存到 HDFS 的 CSV 文件创建表.问题是 csv 在引号内包含 换行符.CSV 记录示例:

I try to create table from CSV file which is save into HDFS. The problem is that the csv consist line break inside of quote. Example of record in CSV:

ID,PR_ID,SUMMARY
2063,1184,"This is problem field because consists line break

This is not new record but it is part of text of third column
"

我创建了 hive 表:

I created hive table:

CREATE TEMPORARY EXTERNAL TABLE  hive_database.hive_table
(   
    ID STRING,
    PR_ID STRING,
    SUMMARY STRING 
)
row format serde 'com.bizo.hive.serde.csv.CSVSerde'
with serdeproperties (
    "separatorChar" = ",",
    "quoteChar"     = """,
    "escapeChar"  = """
)     
stored as textfile
LOCATION '/path/to/hdfs/dir/csv'
tblproperties('skip.header.line.count'='1');

然后我尝试计算行数(正确的结果应该是 1)

Then I try to count the rows (The correct result should by 1)

Select count(*) from hive_database.hive_table;

但是结果是4什么是不正确的.你知道如何解决它吗?谢谢大家.

But the result is 4 what is incorrect. Do you have any idea how to solve it? Thanks all.

推荐答案

现在没有办法在 hive 中直接处理多行 csv.但是,有一些解决方法:

There is right now no way to handle multilines csv in hive directly. However, there is some workaround:

  1. 生成一个 csv,其中 替换为您自己的换行标记,例如 <r>.您将能够在 hive 中加载它.然后通过将后者替换为前者来转换结果文本

  1. produce a csv with or replaced with your own newline marker such <r>. You will be able to load it in hive. Then transform the resulting text by replacing the latter by the former

使用 spark,它有一个多行 csv 阅读器.这可以解决问题,而不会以分布式方式读取 csv.

use spark, it has a multiline csv reader. This works out the box, while the csv beeing not read in a distributed way.

val df = spark.read
.option("wholeFile", true)
.option("multiline",true)
.option("header", true)
.option("inferSchema", "true")
.option("dateFormat", "yyyy-MM-dd")
.option("timestampFormat", "yyyy-MM-dd HH:mm:ss")
.csv("test.csv")
.write.format("orc")
.saveAsTable("myschma.myTable")

  • 使用其他格式,例如 parquet、avro、orc、序列文件,而不是 csv.例如,您可以使用 sqoop 从 jdbc 数据库生成它们.或者你可以用 java 或 python 编写自己的程序.

  • use an other format such parquet, avro, orc, sequence file, instead of a csv. For example you could use sqoop to produce them from a jdbc database. Or you could write your own program in java or python.

    这篇关于来自 CSV 的 Hive 表.引号中的行终止的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

  • 查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆