从Lambda中的S3通知事件获取非ASCII文件名 [英] Get non-ASCII filename from S3 notification event in Lambda

查看:189
本文介绍了从Lambda中的S3通知事件获取非ASCII文件名的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

AWS S3通知事件中的key字段表示文件名,已转义URL.

The key field in an AWS S3 notification event, which denotes the filename, is URL escaped.

当文件名包含空格或非ASCII字符时,这很明显.

This is evident when the filename contains spaces or non-ASCII characters.

例如,我已将以下文件名上传到S3:

For example, I have upload the following filename to S3:

my file řěąλλυ.txt

收到的通知为:

{ 
  "Records": [
    "s3": {
        "object": {
            "key": u"my+file+%C5%99%C4%9B%C4%85%CE%BB%CE%BB%CF%85.txt"
        }
    }
  ]
}

我尝试使用以下方法进行解码:

I've tried to decode using:

key = urllib.unquote_plus(event['Records'][0]['s3']['object']['key']).decode('utf-8')

但这会产生:

my file ÅÄÄλλÏ.txt

当然,当我随后尝试使用Boto从S3获取文件时,会出现404错误.

Of course, when I then try to get the file from S3 using Boto, I get a 404 error.

推荐答案

tl; dr

您需要先将URL编码的Unicode字符串转换为字节数str,然后再取消对它的URL解析并将其解码为UTF-8.

tl;dr

You need to convert the URL encoded Unicode string to a bytes str before un-urlparsing it and decoding as UTF-8.

例如,对于文件名为my file řěąλλυ.txt的S3对象:

For example, for an S3 object with the filename: my file řěąλλυ.txt:

>>> utf8_urlencoded_key = event['Records'][0]['s3']['object']['key'].encode('utf-8')
# encodes the Unicode string to utf-8 encoded [byte] string. The key shouldn't contain any non-ASCII at this point, but UTF-8 will be safer.
'my+file+%C5%99%C4%9B%C4%85%CE%BB%CE%BB%CF%85.txt'

>>> key_utf8 = urllib.unquote_plus(utf8_urlencoded_key)
# the previous url-escaped UTF-8 are now converted to UTF-8 bytes
# If you passed a Unicode object to unquote_plus, you'd have got a 
# Unicode with UTF-8 encoded bytes!
'my file \xc5\x99\xc4\x9b\xc4\x85\xce\xbb\xce\xbb\xcf\x85.txt'

# Decodes key_utf-8 to a Unicode string
>>> key = key_utf8.decode('utf-8')
u'my file \u0159\u011b\u0105\u03bb\u03bb\u03c5.txt'
# Note the u prefix. The utf-8 bytes have been decoded to Unicode points.

>>> type(key)
<type 'unicode'>

>>> print(key)
my file řěąλλυ.txt

背景

AWS犯下了更改默认编码的主要罪过- https://anonbadger.wordpress.com/2015/06/16/why-sys-setdefaultencoding-will-break-code/

您应该从decode()中得到的错误是:

The error you should've got from your decode() is:

UnicodeEncodeError: 'ascii' codec can't encode characters in position 8-19: ordinal not in range(128)

key的值是Unicode.在Python 2.x中,即使没有意义,您也可以解码Unicode.在要解码Unicode的Python 2.x中,Python首先尝试先将其编码为[byte] str,然后再使用给定的编码对其进行解码.在Python 2.x中,默认编码应为ASCII,当然不能包含所使用的字符.

The value of key is a Unicode. In Python 2.x you could decode a Unicode, even though it doesn't make sense. In Python 2.x to decode a Unicode, Python first tries to encode it to a [byte] str first before decoding it using the given encoding. In Python 2.x the default encoding should be ASCII, which of course can't contain the characters used.

如果您从Python获得了正确的UnicodeEncodeError,则可能找到了合适的答案.在Python 3上,您根本无法调用.decode().

Had you got the proper UnicodeEncodeError from Python, you may have found suitable answers. On Python 3, you wouldn't have been able to call .decode() at all.

这篇关于从Lambda中的S3通知事件获取非ASCII文件名的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆