如何处理AWS Athena中的嵌入换行符 [英] How to handle embed line breaks in AWS Athena

查看:106
本文介绍了如何处理AWS Athena中的嵌入换行符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在AWS Athena中创建了一个表,如下所示:

I have created a table in AWS Athena like this:

CREATE EXTERNAL TABLE IF NOT EXISTS default.test_line_breaks (
  col1 string, 
  col2 string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
 'separatorChar' = ',',
 'quoteChar' = '\"',
 'escapeChar' = '\\'
)
STORED AS TEXTFILE
LOCATION 's3://bucket/test/'

在存储桶中,我放入了一个具有以下上下文的简单CSV文件:

In the bucket I put a simple CSV file with the following context:

rec1 col1,rec2 col2
rec2 col1,"rec2, col2"
rec3 col1,"rec3
col2"

当我运行数据预览请求SELECT * FROM "default"."test_line_breaks" limit 10;时,Athena返回以下响应:

When I run data preview request SELECT * FROM "default"."test_line_breaks" limit 10; then Athena returns the following response:

如何设置ROW FORMAT以正确处理字段值内的换行符?这样rec3\ncol2出现在col2中.

How should I set ROW FORMAT to properly handle line breaks within the field values? So that rec3\ncol2 appears in col2.

推荐答案

此处的问题是OpenCSV Serializer-Deserializer

The problem here is that the OpenCSV Serializer-Deserializer

不支持CSV文件中的嵌入式换行符.

Does not support embedded line breaks in CSV files.

请参阅 AWS中的本文档.

但是,可能可以使用 RegexSerDe .请记住,该反序列化器将采用" Java风格​​正则表达式.因此,请确保在调试中使用支持该语法的在线Regex工具.

However, it might be possible to use RegexSerDe. Just remember that this Deserializer will take "Java Flavored" Regex. So be sure to use an online Regex tool that supports that syntax in your debugging.

仍在处理嵌入式换行符\n的语法.但是,这是处理带有可选引号的两列的示例.下列正则表达式"*([^"]*)"*,"*([^"]*)"* 工作与嵌入式回车一起使用.但是,我认为 Presto引擎仅向其提供rec3 col1,"rec3.我继续努力.

Still working on the syntax for dealing with the embedded line feed \n. However, here is a sample that handles two columns with optional quotes. The following regex "*([^"]*)"*,"*([^"]*)"* worked on your line with the embedded return carriage. However, I think the Presto Engine is only feeding it rec3 col1,"rec3. I continue working on it.

CREATE EXTERNAL TABLE IF NOT EXISTS default.test_line_breaks (
  col1 string, 
  col2 string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = '"*([^"]*)"*,"*([^"]*)"*'
)
STORED AS TEXTFILE
LOCATION 's3://.../47936191';

这篇关于如何处理AWS Athena中的嵌入换行符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆