AWS Athena:命名为boto3的查询未创建对应的表 [英] AWS Athena: Named boto3 queries not creating corresponding tables

查看:45
本文介绍了AWS Athena:命名为boto3的查询未创建对应的表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下 boto3 脚本草稿

 <代码>#!/usr/bin/env python3导入boto3客户端= boto3.client('athena')BUCKETS ='buckets.txt'DATABASE ='some_db'QUERY_STR ="如果不存在some_db,则创建外部表.{}(BucketOwner STRING,铲斗STRING,RequestDateTime STRING,RemoteIP STRING,请求者STRING,RequestID STRING,操作STRING,关键STRING,RequestURI_operation STRING,RequestURI_key STRING,RequestURI_httpProtoversion STRING,HTTPstatus STRING,错误代码STRING,发送的字节数BIGINT,ObjectSize BIGINT,TotalTime STRING,TurnAroundTime STRING,推荐连结STRING,UserAgent STRING,VersionId STRING,HostId STRING,SigV STRING,CipherSuite STRING,AuthType STRING,端点STRING,TLS版本STRING)行格式SERDE'org.apache.hadoop.hive.serde2.RegexSerDe'带有SERDEPROPERTIES('serialization.format'='1','input.regex'='([^] *)([^] *)\\ [(.*?)\\]([^] *)([^]*)([^] *)([^] *)([^] *)\\\([^] *)([^] *)(-| [^] *)\\\"(-| [0-9] *)([^] *)([^] *)([^] *)([^] *)([^] *)([^] *)(\"[^ \] * \")([^] *)(?:([^] *)([^] *)([^] *)([^] *)([^] *)([^] *))?.* $')位置's3://my-bucket/{}'"将open(BUCKETS,'r')设为f:行= f.readlines()对于行中的行:query_string = QUERY_STR.format(行,行)响应= client.create_named_query(名称=线数据库=数据库,QueryString = QUERY_STR)打印(响应) 

执行后,所有响应均返回状态码 200 .

为什么我看不到应该创建的对应表?

我应该(至少)不能看到那些存储在哪里的查询吗?

update1 :我现在正尝试通过上述查询实际创建表,如下所示:

 用于以下行:query_string = QUERY_STR.format(数据库,行[:-1].替换('-','_'),行[:-1])尝试:response1 = client.start_query_execution(QueryString = query_string,工作组= WORKGROUP,QueryExecutionContext = {数据库":DATABASE},ResultConfiguration = {'OutputLocation':OUTPUT_BUCKET,},)query_execution_id = response1 ['ResponseMetadata'] ['RequestId']打印(query_execution_id)除了e1例外:打印(query_string)提高(e1) 

再次,该脚本确实输出了一些查询ID(似乎没有错误),尽管如此,仍未创建表.

我还遵循了@John Rotenstein的建议,并如下初始化了我的 boto3 客户端:

  client = boto3.client('athena',region_name ='us-east-1') 

解决方案

首先,响应只是告诉您您的请求已成功提交.方法 create_named_query()创建查询的摘要,然后可以在AWS Athena控制台的 Saved Queries 选项卡中查看/访问该摘要.

在我看来,您想使用 boto3 创建表.在这种情况下,您需要使用

我得到了响应200,并在 response1 ['ResponseMetadata'] ['RequestId'] 中获得了一些价值.但是,由于我在AWS Glue目录中没有 some_db ,因此该查询在运行时失败,因此未创建任何表.

这是在boto3中跟踪查询执行的方式

 导入时间response1 = client.start_query_execution(QueryString = query_string,工作组= WORKGROUP,QueryExecutionContext = {数据库":DATABASE},ResultConfiguration = {'OutputLocation':OUTPUT_BUCKET,},)query_execution_id = response1 ['ResponseMetadata'] ['RequestId']而True:time.sleep(1)response_2 = client.get_query_execution(QueryExecutionId = query_execution_id)query_status = response_2 ['QueryExecution'] ['Status']打印(query_status)如果query_status不在["QUEUED","RUNNING","CANCELLED"]中:休息 

I have the following boto3 draft script

#!/usr/bin/env python3
import boto3

client = boto3.client('athena')

BUCKETS='buckets.txt'
DATABASE='some_db'
QUERY_STR="""CREATE EXTERNAL TABLE IF NOT EXISTS some_db.{}(
         BucketOwner STRING,
         Bucket STRING,
         RequestDateTime STRING,
         RemoteIP STRING,
         Requester STRING,
         RequestID STRING,
         Operation STRING,
         Key STRING,
         RequestURI_operation STRING,
         RequestURI_key STRING,
         RequestURI_httpProtoversion STRING,
         HTTPstatus STRING,
         ErrorCode STRING,
         BytesSent BIGINT,
         ObjectSize BIGINT,
         TotalTime STRING,
         TurnAroundTime STRING,
         Referrer STRING,
         UserAgent STRING,
         VersionId STRING,
         HostId STRING,
         SigV STRING,
         CipherSuite STRING,
         AuthType STRING,
         EndPoint STRING,
         TLSVersion STRING
) 
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
         'serialization.format' = '1', 'input.regex' = '([^ ]*) ([^ ]*) \\[(.*?)\\] ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) \\\"([^ ]*) ([^ ]*) (- |[^ ]*)\\\" (-|[0-9]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) (\"[^\"]*\") ([^ ]*)(?: ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*))?.*$' )
LOCATION 's3://my-bucket/{}'"""

with open(BUCKETS, 'r') as f:
    lines = f.readlines()


for line in lines:
    query_string = QUERY_STR.format(line, line)
    response = client.create_named_query(
        Name=line,
        Database=DATABASE,
        QueryString=QUERY_STR
    )
    print(response)

When executed, all responses come back with status code 200.

Why am I not able to see the corresponding tables that should have been created?

Shouldn't I be able to (at least) see somewhere those queries stored?

update1: I am now trying to actually create the tables via the above queries as follows:

for line in lines:
    query_string = QUERY_STR.format(DATABASE, line[:-1].replace('-', '_'), line[:-1])
    try:
        response1 = client.start_query_execution(
            QueryString=query_string,
            WorkGroup=WORKGROUP,
            QueryExecutionContext={
                'Database': DATABASE
            },
            ResultConfiguration={
                'OutputLocation': OUTPUT_BUCKET,
            },
        )
        query_execution_id = response1['ResponseMetadata']['RequestId']
        print(query_execution_id)
    except Exception as e1:
        print(query_string)
        raise(e1)

Once again, the script does output some query ids (no error seems to take place), nonetheless no table is created.

I have also followed the advice of @John Rotenstein and initialised my boto3 client as follows:

client = boto3.client('athena', region_name='us-east-1')

解决方案

First of all, response simply tells you that your request has been successfully submitted. Method create_named_query() creates a snippet of your query, which then can be seen/access in AWS Athena console in Saved Queries tab.

It seems to me that you want to create table using boto3. If that is the case, you need to use start_query_execution() method.

Runs the SQL query statements contained in the Query . Requires you to have access to the workgroup in which the query ran.

Having response 200 out of start_query_execution doesn't guarantee that you query will get executed successfully. As I understand, this method does some simple pre-execution checks to validate syntax of the query. However, there are other things that could fail you query at the run time. For example if you try to create table in a database that doesn't exist, or if you try to create a table definition in a database to which you don't have access.

Here is an example, when I used you query string, formatted with with some random name for the table.

I got response 200 and got some value in response1['ResponseMetadata']['RequestId']. However, since I don't have some_db in AWS Glue catalog, this query failed at the run time, thus, no table was created.

Here is how you can track query execution within boto3

import time

response1 = client.start_query_execution(
    QueryString=query_string,
    WorkGroup=WORKGROUP,
    QueryExecutionContext={
        'Database': DATABASE
    },
    ResultConfiguration={
        'OutputLocation': OUTPUT_BUCKET,
    },
)
query_execution_id = response1['ResponseMetadata']['RequestId']

while True:
    time.sleep(1)
    response_2 = client.get_query_execution(
        QueryExecutionId=query_execution_id
    )
    query_status = response_2['QueryExecution']['Status']
    print(query_status)
    if query_status not in ["QUEUED", "RUNNING", "CANCELLED"]:
        break

这篇关于AWS Athena:命名为boto3的查询未创建对应的表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆