使用通用换行符将Django已上传文件作为UTF-8进行处理 [英] Proccessing a Django UploadedFile as UTF-8 with universal newlines
问题描述
在我的django应用程序中,我提供了一个允许用户上传文件的表单。该文件可以是各种格式(Excel,CSV),来自各种平台(Mac,Linux,Windows),并以各种编码(ASCII,UTF-8)编码。
为了这个问题的目的,让我们假设我有一个视图,它正在接收 request.FILES ['file']
,其中是 InMemoryUploadedFile
的实例,称为文件
。我的问题是 InMemoryUploadedFile
对象(如文件
):
- 不支持UTF-8编码(我在文件的开头看到一个
\xef\xbb\xbf
因为我明白是一个标志,意思是这个文件是UTF-8)。 - 不支持通用换行符(这可能是上传到该系统的大多数文件都需要) / li>
复杂的问题是我希望将文件传递给python csv
模块,它本来不支持Unicode。我会很乐意接受避免这个问题的答案 - 一旦我得到django玩UTF-8,我相信我可以打扰 csv
做同样的事情。 (类似地,请忽略支持Excel的要求 - 我等待直到CSV工作,然后才解决Excel文件的解析。)
我尝试使用 StringIO
, mmap
, codec
,以及各种各样的访问方式 InMemoryUploadedFile
对象中的数据。每种方法都产生了不同的错误,到目前为止还没有完美。这显示了我觉得最接近的一些代码:
import csv
import codecs
class CSVParser:
def __init __(self,file):
#'file'被假定为InMemoryUploadedFile对象。
dialect = csv.Sniffer()。sniff(codecs.EncodedFile(file,utf-8)。read(1024))
file.open()#seek to 0
self .reader = csv.reader(codecs.EncodedFile(file,utf-8),
dialect = dialect)
try:
self.field_names = self.reader.next()
除了StopIteration:
#文件为空 - 这是不允许的。
raise ValueError('Unrecognized format(empty file)')
如果len(self.field_names)< = 1:
#这可能不是CSV文件所有。
#请注意,csv模块将(不正确地)解析所有文件,甚至
#二进制数据。这将捕获大多数这样的文件。
raise ValueError('无法识别的格式(列太少)')
#另外的方法被剪切,与发行
无关...
请注意,我没有花太多时间在实际的解析算法上,所以它可能是非常低效的,现在我更关心使编码按预期工作。
问题是结果也没有编码,尽管被包裹在Unicode codecs.EncodedFile
文件包装
编辑:事实证明,上述代码实际上是有效的。 codecs.EncodedFile(文件utf-8)
是机票。原来我以为没有工作的原因是我使用的终端不支持UTF-8。感谢任何帮助,请让我知道如果我可以为您提供更多信息。
如上所述,我提供的代码片段实际上是按照预期工作的 - 问题在于我的终端,而不是使用python编码。
如果您的视图需要访问UTF-8 UploadedFile
,则可以使用 utf8_file = codecs.EncodedFile(请求。 FILES ['file_field'],utf-8)
以正确的编码打开文件对象。
我也注意到,至少对于 InMemoryUploadedFile
,通过 codecs.EncodedFile
打包文件不会重置 seek()
文件描述符的位置。要返回到文件的开头(再次,这可能是 InMemoryUploadedFile
具体)我刚刚使用 request.FILES ['file_field']。open )
将 seek()
的位置发回0。
In my django application, I provide a form that allows users to upload a file. The file can be in a variety of formats (Excel, CSV), come from a variety of platforms (Mac, Linux, Windows), and be encoded in a variety of encodings (ASCII, UTF-8).
For the purpose of this question, let's assume that I have a view which is receiving request.FILES['file']
, which is an instance of InMemoryUploadedFile
, called file
. My problem is that InMemoryUploadedFile
objects (like file
):
- Do not support UTF-8 encoding (I see a
\xef\xbb\xbf
at the beginning of the file, which as I understand is a flag meaning 'this file is UTF-8'). - Do not support universal newlines (which probably the majority of the files uploaded to this system will need).
Complicating the issue is that I wish to pass the file in to the python csv
module, which does not natively support Unicode. I will happily accept answers that avoid this issue - once I get django playing nice with UTF-8 I'm sure I can bludgeon csv
into doing the same. (Similarly, please ignore the requirement to support Excel - I am waiting until CSV works before I tackle parsing Excel files.)
I have tried using StringIO
,mmap
,codec
, and any of a wide variety of ways of accessing the data in an InMemoryUploadedFile
object. Each approach has yielded differing errors, none so far have been perfect. This shows some of the code that I feel came the closest:
import csv
import codecs
class CSVParser:
def __init__(self,file):
# 'file' is assumed to be an InMemoryUploadedFile object.
dialect = csv.Sniffer().sniff(codecs.EncodedFile(file,"utf-8").read(1024))
file.open() # seek to 0
self.reader = csv.reader(codecs.EncodedFile(file,"utf-8"),
dialect=dialect)
try:
self.field_names = self.reader.next()
except StopIteration:
# The file was empty - this is not allowed.
raise ValueError('Unrecognized format (empty file)')
if len(self.field_names) <= 1:
# This probably isn't a CSV file at all.
# Note that the csv module will (incorrectly) parse ALL files, even
# binary data. This will catch most such files.
raise ValueError('Unrecognized format (too few columns)')
# Additional methods snipped, unrelated to issue
Please note that I haven't spent too much time on the actual parsing algorithm so it may be wildly inefficient, right now I'm more concerned with getting encoding to work as expected.
The problem is that the results are also not encoded, despite being wrapped in the Unicode codecs.EncodedFile
file wrapper.
EDIT: It turns out, the above code does in fact work. codecs.EncodedFile(file,"utf-8")
is the ticket. It turns out the reason I thought it didn't work was that the terminal I was using does not support UTF-8. Live and learn!
Thanks for any help, and please let me know if I can supply you with more information.
As mentioned above, the code snippet I provided was in fact working as intended - the problem was with my terminal, and not with python encoding.
If your view needs to access a UTF-8 UploadedFile
, you can just use utf8_file = codecs.EncodedFile(request.FILES['file_field'],"utf-8")
to open a file object in the correct encoding.
I also noticed that, at least for InMemoryUploadedFile
s, opening the file through the codecs.EncodedFile
wrapper does NOT reset the seek()
position of the file descriptor. To return to the beginning of the file (again, this may be InMemoryUploadedFile
specific) I just used request.FILES['file_field'].open()
to send the seek()
position back to 0.
这篇关于使用通用换行符将Django已上传文件作为UTF-8进行处理的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!