AWS 雅典娜.不能使用 CSV 清单作为位置 [英] AWS Athena. Can't use CSV manifest as a location
问题描述
我正在尝试在 Athena 中创建一个外部表,问题是 s3 存储桶在同一文件夹中具有不同的文件,因此我无法将该文件夹用作位置.
I'm trying to create an external table in Athena, the problem is that the s3 bucket has different files in the same folder so I can't use the folder as a location.
我无法修改 s3 文件的路径,但我有一个 CSV 清单,我试图将其用作位置,但 Athena 不允许我这样做.
I can't modify the path of the s3 files but I have a CSV manifest, I was trying to use it as a location but Athena didn't allow me to do that.
CREATE EXTERNAL TABLE `my_DB`.`my_external_table`(
column1 string,
column2 string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'separatorChar' = ',',
'quoteChar' = '\"',
'escapeChar' = '\\'
)
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://mys3bucket/tables/my_table.csvmanifest'
TBLPROPERTIES (
'has_encrypted_data'='false',
'skip.header.line.count'='1')
有什么想法可以使用我的清单吗?或者没有雅典娜解决这个问题的另一种方法?使用 Athena 的目的是避免从 CSV 中获取所有数据,因为我只需要很少的记录
Any ideas to use my manifest? or another way to solve this without Athena? The goal of using Athena was to avoid getting all the data from the CSVs since I only need few records
推荐答案
您需要对 CREATE TABLE
语句进行一些更改:
You'll need to make a couple changes to your CREATE TABLE
statement:
- 使用
'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
作为你的INPUTFORMAT
- 确保您使用
LOCATION
语句指向一个文件夹
- use
'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
as yourINPUTFORMAT
- Ensure you're pointing to a folder with your
LOCATION
statement
所以你的语句看起来像:
So your statement would look like:
CREATE EXTERNAL TABLE `my_DB`.`my_external_table`(
column1 string,
column2 string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'separatorChar' = ',',
'quoteChar' = '\"',
'escapeChar' = '\\'
)
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://mys3bucket/tables/my_table/'
并且 s3://mys3bucket/tables/my_table/
将在其中包含一个文件,其中包含您要查询的 CSV 文件的 S3 路径 - 每行一个路径.我不确定 skip.header.line.count
设置是否适用于清单文件本身或 CSV 文件,因此您必须进行测试.
And s3://mys3bucket/tables/my_table/
would have a single file in it with the S3 paths of the CSV files you want query - one path per line. I'm unsure if the skip.header.line.count
setting will operate on the manifest file itself or the CSV files so you will have to test.
或者,如果您的文件数量有限,您可以使用 S3 选择 以查询这些文件中的特定列,一次一个.使用 AWS CLI,提取第二列的命令类似于:
Alternatively, if you have a limited number of files, you could use S3 Select to query for specific columns in those files, one at a time. Using the AWS CLI, the command to extract the 2nd column would look something like:
aws s3api select-object-content \
--bucket mys3bucket \
--key path/to/your.csv.gz \
--expression "select _2 from s3object limit 100" \
--expression-type SQL \
--input-serialization '{"CSV": {}, "CompressionType": "GZIP"}' \
--output-serialization '{"CSV":{}}' \
sample.csv
(免责声明:AWS 员工)
这篇关于AWS 雅典娜.不能使用 CSV 清单作为位置的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!