AWS 雅典娜.不能使用 CSV 清单作为位置 [英] AWS Athena. Can't use CSV manifest as a location

查看:22
本文介绍了AWS 雅典娜.不能使用 CSV 清单作为位置的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在 Athena 中创建一个外部表,问题是 s3 存储桶在同一文件夹中具有不同的文件,因此我无法将该文件夹用作位置.

I'm trying to create an external table in Athena, the problem is that the s3 bucket has different files in the same folder so I can't use the folder as a location.

我无法修改 s3 文件的路径,但我有一个 CSV 清单,我试图将其用作位置,但 Athena 不允许我这样做.

I can't modify the path of the s3 files but I have a CSV manifest, I was trying to use it as a location but Athena didn't allow me to do that.

CREATE EXTERNAL TABLE `my_DB`.`my_external_table`(
    column1 string,
    column2 string
  )
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
   'separatorChar' = ',',
   'quoteChar' = '\"',
   'escapeChar' = '\\'
   )
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://mys3bucket/tables/my_table.csvmanifest'
TBLPROPERTIES (
  'has_encrypted_data'='false', 
  'skip.header.line.count'='1')

有什么想法可以使用我的清单吗?或者没有雅典娜解决这个问题的另一种方法?使用 Athena 的目的是避免从 CSV 中获取所有数据,因为我只需要很少的记录

Any ideas to use my manifest? or another way to solve this without Athena? The goal of using Athena was to avoid getting all the data from the CSVs since I only need few records

推荐答案

您需要对 CREATE TABLE 语句进行一些更改:

You'll need to make a couple changes to your CREATE TABLE statement:

  1. 使用 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat' 作为你的 INPUTFORMAT
  2. 确保您使用LOCATION 语句指向一个文件夹
  1. use 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat' as your INPUTFORMAT
  2. Ensure you're pointing to a folder with your LOCATION statement

所以你的语句看起来像:

So your statement would look like:

CREATE EXTERNAL TABLE `my_DB`.`my_external_table`(
    column1 string,
    column2 string
  )
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
   'separatorChar' = ',',
   'quoteChar' = '\"',
   'escapeChar' = '\\'
   )
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://mys3bucket/tables/my_table/'

并且 s3://mys3bucket/tables/my_table/ 将在其中包含一个文件,其中包含您要查询的 CSV 文件的 S3 路径 - 每行一个路径.我不确定 skip.header.line.count 设置是否适用于清单文件本身或 CSV 文件,因此您必须进行测试.

And s3://mys3bucket/tables/my_table/ would have a single file in it with the S3 paths of the CSV files you want query - one path per line. I'm unsure if the skip.header.line.count setting will operate on the manifest file itself or the CSV files so you will have to test.

或者,如果您的文件数量有限,您可以使用 S3 选择 以查询这些文件中的特定列,一次一个.使用 AWS CLI,提取第二列的命令类似于:

Alternatively, if you have a limited number of files, you could use S3 Select to query for specific columns in those files, one at a time. Using the AWS CLI, the command to extract the 2nd column would look something like:

aws s3api select-object-content \
  --bucket mys3bucket \
  --key path/to/your.csv.gz \
  --expression "select _2 from s3object limit 100" \
  --expression-type SQL \
  --input-serialization '{"CSV": {}, "CompressionType": "GZIP"}' \
  --output-serialization '{"CSV":{}}' \
  sample.csv

(免责声明:AWS 员工)

这篇关于AWS 雅典娜.不能使用 CSV 清单作为位置的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆