从 Lambda 中的 S3 通知事件获取非 ASCII 文件名 [英] Get non-ASCII filename from S3 notification event in Lambda

查看:23
本文介绍了从 Lambda 中的 S3 通知事件获取非 ASCII 文件名的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

AWS S3 通知事件中的 key 字段(表示文件名)是 URL 转义的.

当文件名包含空格或非 ASCII 字符时,这很明显.

例如,我已将以下文件名上传到 S3:

我的文件 řěąλλυ.txt

收到通知为:

<代码>{记录": [s3":{目的": {"key": u"我的+文件+%C5%99%C4%9B%C4%85%CE%BB%CE%BB%CF%85.txt"}}]}

我尝试使用以下方法进行解码:

key = urllib.unquote_plus(event['Records'][0]['s3']['object']['key']).decode('utf-8')

但这会产生:

我的文件 ÅÄÄÄλλÏ.txt

当然,当我尝试使用 Boto 从 S3 获取文件时,我收到了 404 错误.

解决方案

tl;dr

您需要先将 URL 编码的 Unicode 字符串转换为字节 str,然后再对它进行解 url 解析并解码为 UTF-8.

例如,对于具有文件名的 S3 对象:my file řěąλλυ.txt:

<预><代码>>>>utf8_urlencoded_key = event['Records'][0]['s3']['object']['key'].encode('utf-8')# 将 Unicode 字符串编码为 utf-8 编码的 [byte] 字符串.此时密钥不应包含任何非 ASCII,但 UTF-8 会更安全.'我的+文件+%C5%99%C4%9B%C4%85%CE%BB%CE%BB%CF%85.txt'>>>key_utf8 = urllib.unquote_plus(utf8_urlencoded_key)# 以前的 url 转义 UTF-8 现在转换为 UTF-8 字节# 如果你将一个 Unicode 对象传递给 unquote_plus,你会得到一个# Unicode 与 UTF-8 编码的字节!'我的文件 xc5x99xc4x9bxc4x85xcexbbxcexbbxcfx85.txt'# 将 key_utf-8 解码为 Unicode 字符串>>>key = key_utf8.decode('utf-8')你是我的文件 u0159u011bu0105u03bbu03bbu03c5.txt'# 注意 u 前缀.utf-8 字节已被解码为 Unicode 点.>>>类型(键)<输入'unicode'>>>>打印(键)我的文件 řěąλλυ.txt

背景

AWS 犯下了更改默认编码的大罪 - https://anonbadger.wordpress.com/2015/06/16/why-sys-setdefaultencoding-will-break-code/

你应该从 decode() 得到的错误是:

UnicodeEncodeError: 'ascii' 编解码器无法对位置 8-19 中的字符进行编码:序号不在范围内 (128)

key 的值是一个 Unicode.在 Python 2.x 中,您可以解码 Unicode,即使它没有意义.在用于解码 Unicode 的 Python 2.x 中,Python 首先尝试将其编码为 [byte] str,然后再使用给定的编码对其进行解码.在 Python 2.x 中,默认编码应该是 ASCII,当然不能包含使用的字符.

如果您从 Python 中得到了正确的 UnicodeEncodeError,您可能已经找到了合适的答案.在 Python 3 上,您根本无法调用 .decode().

The key field in an AWS S3 notification event, which denotes the filename, is URL escaped.

This is evident when the filename contains spaces or non-ASCII characters.

For example, I have upload the following filename to S3:

my file řěąλλυ.txt

The notification is received as:

{ 
  "Records": [
    "s3": {
        "object": {
            "key": u"my+file+%C5%99%C4%9B%C4%85%CE%BB%CE%BB%CF%85.txt"
        }
    }
  ]
}

I've tried to decode using:

key = urllib.unquote_plus(event['Records'][0]['s3']['object']['key']).decode('utf-8')

but that yields:

my file ÅÄÄλλÏ.txt

Of course, when I then try to get the file from S3 using Boto, I get a 404 error.

解决方案

tl;dr

You need to convert the URL encoded Unicode string to a bytes str before un-urlparsing it and decoding as UTF-8.

For example, for an S3 object with the filename: my file řěąλλυ.txt:

>>> utf8_urlencoded_key = event['Records'][0]['s3']['object']['key'].encode('utf-8')
# encodes the Unicode string to utf-8 encoded [byte] string. The key shouldn't contain any non-ASCII at this point, but UTF-8 will be safer.
'my+file+%C5%99%C4%9B%C4%85%CE%BB%CE%BB%CF%85.txt'

>>> key_utf8 = urllib.unquote_plus(utf8_urlencoded_key)
# the previous url-escaped UTF-8 are now converted to UTF-8 bytes
# If you passed a Unicode object to unquote_plus, you'd have got a 
# Unicode with UTF-8 encoded bytes!
'my file xc5x99xc4x9bxc4x85xcexbbxcexbbxcfx85.txt'

# Decodes key_utf-8 to a Unicode string
>>> key = key_utf8.decode('utf-8')
u'my file u0159u011bu0105u03bbu03bbu03c5.txt'
# Note the u prefix. The utf-8 bytes have been decoded to Unicode points.

>>> type(key)
<type 'unicode'>

>>> print(key)
my file řěąλλυ.txt

Background

AWS have commited the cardinal sin of changing the default encoding - https://anonbadger.wordpress.com/2015/06/16/why-sys-setdefaultencoding-will-break-code/

The error you should've got from your decode() is:

UnicodeEncodeError: 'ascii' codec can't encode characters in position 8-19: ordinal not in range(128)

The value of key is a Unicode. In Python 2.x you could decode a Unicode, even though it doesn't make sense. In Python 2.x to decode a Unicode, Python first tries to encode it to a [byte] str first before decoding it using the given encoding. In Python 2.x the default encoding should be ASCII, which of course can't contain the characters used.

Had you got the proper UnicodeEncodeError from Python, you may have found suitable answers. On Python 3, you wouldn't have been able to call .decode() at all.

这篇关于从 Lambda 中的 S3 通知事件获取非 ASCII 文件名的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆