从AWS S3下载文件时的文件编码问题 [英] File encoding issue when downloading file from AWS S3

查看:992
本文介绍了从AWS S3下载文件时的文件编码问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在AWS S3中有一个CSV文件,试图在本地临时文件中打开该文件.这是代码:

I have a CSV file in AWS S3 that I'm trying to open in a local temp file. This is the code:

s3 = Aws::S3::Resource.new
bucket = s3.bucket({bucket name})
obj = bucket.object({object key})
temp = Tempfile.new('temp.csv')
obj.get(response_target: temp)

它从AWS提取文件,并将其加载到名为"temp.csv"的新临时文件中.对于某些文件,obj.get(..)行会引发以下错误:

It pulls the file from AWS and loads it in a new temp file called 'temp.csv'. For some files, the obj.get(..) line throws the following error:

WARN: Encoding::UndefinedConversionError: "\xEF" from ASCII-8BIT to UTF-8
WARN: /Users/.rbenv/versions/2.5.0/lib/ruby/2.5.0/delegate.rb:349:in `write'
/Users/.rbenv/versions/2.5.0/lib/ruby/2.5.0/delegate.rb:349:in `block in delegating_block'
/Users/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/aws-sdk-core-3.21.2/lib/seahorse/client/http/response.rb:62:in `signal_data'
/Users/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/aws-sdk-core-3.21.2/lib/seahorse/client/net_http/handler.rb:83:in `block (3 levels) in transmit'
...
/Users/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/aws-sdk-s3-1.13.0/lib/aws-sdk-s3/client.rb:2666:in `get_object'
/Users/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/aws-sdk-s3-1.13.0/lib/aws-sdk-s3/object.rb:657:in `get'

Stacktrace显示最初由.get从适用于Ruby的AWS开发工具包引发的错误.

Stacktrace shows the error initially gets thrown by the .get from the AWS SDK for Ruby.

我尝试过的事情:

将文件(对象)上传到AWS S3时,您可以指定content_encoding,因此我尝试将其设置为UTF-8:

When uploading the file (object) to AWS S3, you can specify content_encoding, so I tried setting that to UTF-8:

obj.upload_file({file path}, content_encoding: 'utf-8')

也可以在呼叫.get时设置response_content_encoding:

obj.get(response_target: temp, response_content_encoding: 'utf-8')

这些都不起作用,它们会导致与上述相同的错误.我真的希望能做到这一点.在AWS S3仪表板中,我可以看到确实通过代码正确设置了内容编码,但似乎没有什么不同.

Neither of those work, they result in the same error as above. I would really expect that to do the trick. In the AWS S3 dashboard I can see that the content encoding is indeed set correctly via the code but it doesn't appear to make a difference.

在上面的第一个代码段中,当我执行以下操作时,它确实起作用:

It does work when I do the following, in the first code snippet above:

temp = Tempfile.new('temp.csv', encoding: 'ascii-8bit')

但是我更喜欢使用正确的编码从AWS S3上传和/或下载文件.有人可以解释为什么在临时文件上指定编码有效吗?还是如何使其通过AWS S3上载/下载工作?

But I'd prefer to upload and/or download the file from AWS S3 with the proper encoding. Can someone explain why specifying the encoding on the tempfile works? Or how to make it work through the AWS S3 upload/download?

重要注意事项:错误消息中有问题的字符似乎只是我正在使用的此自动生成文件开头添加的随机符号.我不担心会正确读取字符,无论如何我都会在解析文件时忽略它.

Important to note: The problematic character in the error message appears to just be a random symbol added at the beginning of this auto-generated file I'm working with. I'm not worried about reading the character correctly, it gets ignored when I parse the file anyways.

推荐答案

对于您的所有问题我都没有完整的答案,但是我认为我有一个通用的解决方案,那就是始终将临时文件放入二进制文件中模式.这样,AWS gem只需将存储桶中的数据转储到文件中,而无需任何进一步的重新编码:

I don't have a full answer to all your question, but I think I have a generalized solution, and that is to always put the temp file into binary mode. That way the AWS gem will simply dump the data from the bucket into the file, without any further re/encoding:

步骤1(将临时文件置于bin模式):

temp = Tempfile.new('temp.csv')
temp.binmode

但是,您会遇到一个问题,那就是事实是您的UTF-8文件中现在有一个3字节的BOM头.

You will however have a problem, and that is the fact that there is a 3-byte BOM header in your UTF-8 file now.

我不知道此BOM的来源.文件上传到那里了吗?如果是这样,最好在上传之前剥离3字节的BOM.

I don't know where this BOM came from. Was it there when the file was uploaded? If so, it might be a good idea to strip the 3 byte BOM before uploading.

但是,如果您按如下所示设置系统,那就没关系了,因为Ruby支持带或不带BOM的UTF-8透明读取,并且无论BOM头是在文件中还是在文件中,都将正确返回字符串.不是:

However, if you set up your system as below, it will not matter, because Ruby supports transparent reading of UTF-8 with or without BOM, and will return the string correctly regardless of if the BOM header is in the file or not:

第2步(使用bom | utf-8处理文件):

File.read(temp.path, encoding: "bom|utf-8")
# or...
CSV.read(temp.path,  encoding: "bom|utf-8")

这应该涵盖我认为的所有基础.无论您接收的是编码为BOM + UTF-8还是纯UTF-8的文件,都将以这种方式正确处理它们,而最终字符串中不会出现任何额外的标题字符,并且在使用AWS保存文件时不会出错.

This should cover all your bases I think. Whether you receive files encoded as BOM + UTF-8 or plain UTF-8, you will process them correctly this way, without any extra header characters appearing in the final string, and without errors when saving them with AWS.

另一个选项(来自OP)

使用obj.get.body代替,这将绕过response_target和Tempfile的整个问题.

Use obj.get.body instead, which will bypass the whole issue with response_target and Tempfile.

有用的参考文献:
有没有办法从UTF-8编码文件中删除BOM?
如何避免在以下情况下因UTF-8 BOM跳闸读取文件
不带UTF-8和UTF-8的区别是什么BOM?
如何将BOM标记写入Ruby中的文件

Useful references:
Is there a way to remove the BOM from a UTF-8 encoded file?
How to avoid tripping over UTF-8 BOM when reading files
What's the difference between UTF-8 and UTF-8 without BOM?
How to write BOM marker to a file in Ruby

这篇关于从AWS S3下载文件时的文件编码问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆