从Lambda中的S3通知事件获取非ASCII文件名 [英] Get non-ASCII filename from S3 notification event in Lambda
问题描述
AWS S3通知事件中的key
字段表示文件名,已转义URL.
The key
field in an AWS S3 notification event, which denotes the filename, is URL escaped.
当文件名包含空格或非ASCII字符时,这很明显.
This is evident when the filename contains spaces or non-ASCII characters.
例如,我已将以下文件名上传到S3:
For example, I have upload the following filename to S3:
my file řěąλλυ.txt
收到的通知为:
{
"Records": [
"s3": {
"object": {
"key": u"my+file+%C5%99%C4%9B%C4%85%CE%BB%CE%BB%CF%85.txt"
}
}
]
}
我尝试使用以下方法进行解码:
I've tried to decode using:
key = urllib.unquote_plus(event['Records'][0]['s3']['object']['key']).decode('utf-8')
但这会产生:
my file ÅÄÄλλÏ.txt
当然,当我随后尝试使用Boto从S3获取文件时,会出现404错误.
Of course, when I then try to get the file from S3 using Boto, I get a 404 error.
推荐答案
tl; dr
您需要先将URL编码的Unicode字符串转换为字节数str,然后再取消对它的URL解析并将其解码为UTF-8.
tl;dr
You need to convert the URL encoded Unicode string to a bytes str before un-urlparsing it and decoding as UTF-8.
例如,对于文件名为my file řěąλλυ.txt
的S3对象:
For example, for an S3 object with the filename: my file řěąλλυ.txt
:
>>> utf8_urlencoded_key = event['Records'][0]['s3']['object']['key'].encode('utf-8')
# encodes the Unicode string to utf-8 encoded [byte] string. The key shouldn't contain any non-ASCII at this point, but UTF-8 will be safer.
'my+file+%C5%99%C4%9B%C4%85%CE%BB%CE%BB%CF%85.txt'
>>> key_utf8 = urllib.unquote_plus(utf8_urlencoded_key)
# the previous url-escaped UTF-8 are now converted to UTF-8 bytes
# If you passed a Unicode object to unquote_plus, you'd have got a
# Unicode with UTF-8 encoded bytes!
'my file \xc5\x99\xc4\x9b\xc4\x85\xce\xbb\xce\xbb\xcf\x85.txt'
# Decodes key_utf-8 to a Unicode string
>>> key = key_utf8.decode('utf-8')
u'my file \u0159\u011b\u0105\u03bb\u03bb\u03c5.txt'
# Note the u prefix. The utf-8 bytes have been decoded to Unicode points.
>>> type(key)
<type 'unicode'>
>>> print(key)
my file řěąλλυ.txt
背景
AWS犯下了更改默认编码的主要罪过- https://anonbadger.wordpress.com/2015/06/16/why-sys-setdefaultencoding-will-break-code/
您应该从decode()
中得到的错误是:
The error you should've got from your decode()
is:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 8-19: ordinal not in range(128)
key
的值是Unicode.在Python 2.x中,即使没有意义,您也可以解码Unicode.在要解码Unicode的Python 2.x中,Python首先尝试先将其编码为[byte] str,然后再使用给定的编码对其进行解码.在Python 2.x中,默认编码应为ASCII,当然不能包含所使用的字符.
The value of key
is a Unicode. In Python 2.x you could decode a Unicode, even though it doesn't make sense. In Python 2.x to decode a Unicode, Python first tries to encode it to a [byte] str first before decoding it using the given encoding. In Python 2.x the default encoding should be ASCII, which of course can't contain the characters used.
如果您从Python获得了正确的UnicodeEncodeError,则可能找到了合适的答案.在Python 3上,您根本无法调用.decode()
.
Had you got the proper UnicodeEncodeError from Python, you may have found suitable answers. On Python 3, you wouldn't have been able to call .decode()
at all.
这篇关于从Lambda中的S3通知事件获取非ASCII文件名的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!