如何处理AWS Athena中的嵌入换行符 [英] How to handle embed line breaks in AWS Athena
问题描述
我在AWS Athena中创建了一个表,如下所示:
I have created a table in AWS Athena like this:
CREATE EXTERNAL TABLE IF NOT EXISTS default.test_line_breaks (
col1 string,
col2 string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'separatorChar' = ',',
'quoteChar' = '\"',
'escapeChar' = '\\'
)
STORED AS TEXTFILE
LOCATION 's3://bucket/test/'
在存储桶中,我放入了一个具有以下上下文的简单CSV文件:
In the bucket I put a simple CSV file with the following context:
rec1 col1,rec2 col2
rec2 col1,"rec2, col2"
rec3 col1,"rec3
col2"
当我运行数据预览请求SELECT * FROM "default"."test_line_breaks" limit 10;
时,Athena返回以下响应:
When I run data preview request SELECT * FROM "default"."test_line_breaks" limit 10;
then Athena returns the following response:
如何设置ROW FORMAT
以正确处理字段值内的换行符?这样rec3\ncol2
出现在col2
中.
How should I set ROW FORMAT
to properly handle line breaks within the field values? So that rec3\ncol2
appears in col2
.
推荐答案
此处的问题是OpenCSV Serializer-Deserializer
The problem here is that the OpenCSV Serializer-Deserializer
不支持CSV文件中的嵌入式换行符.
Does not support embedded line breaks in CSV files.
请参阅 AWS中的本文档.
但是,可能可以使用 RegexSerDe .请记住,该反序列化器将采用" Java风格一个>正则表达式.因此,请确保在调试中使用支持该语法的在线Regex工具.
However, it might be possible to use RegexSerDe. Just remember that this Deserializer will take "Java Flavored" Regex. So be sure to use an online Regex tool that supports that syntax in your debugging.
仍在处理嵌入式换行符\n
的语法.但是,这是处理带有可选引号的两列的示例.下列正则表达式"*([^"]*)"*,"*([^"]*)"*
工作与嵌入式回车一起使用.但是,我认为 Presto引擎仅向其提供rec3 col1,"rec3
.我继续努力.
Still working on the syntax for dealing with the embedded line feed \n
. However, here is a sample that handles two columns with optional quotes. The following regex "*([^"]*)"*,"*([^"]*)"*
worked on your line with the embedded return carriage. However, I think the Presto Engine is only feeding it rec3 col1,"rec3
. I continue working on it.
CREATE EXTERNAL TABLE IF NOT EXISTS default.test_line_breaks (
col1 string,
col2 string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = '"*([^"]*)"*,"*([^"]*)"*'
)
STORED AS TEXTFILE
LOCATION 's3://.../47936191';
这篇关于如何处理AWS Athena中的嵌入换行符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!