使用通用换行符将Django已上传文件作为UTF-8进行处理 [英] Proccessing a Django UploadedFile as UTF-8 with universal newlines

查看:536
本文介绍了使用通用换行符将Django已上传文件作为UTF-8进行处理的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在我的django应用程序中,我提供了一个允许用户上传文件的表单。该文件可以是各种格式(Excel,CSV),来自各种平台(Mac,Linux,Windows),并以各种编码(ASCII,UTF-8)编码。



为了这个问题的目的,让我们假设我有一个视图,它正在接收 request.FILES ['file'] ,其中是 InMemoryUploadedFile 的实例,称为文件。我的问题是 InMemoryUploadedFile 对象(如文件):


  1. 不支持UTF-8编码(我在文件的开头看到一个 \xef\xbb\xbf 因为我明白是一个标志,意思是这个文件是UTF-8)。

  2. 不支持通用换行符(这可能是上传到该系统的大多数文件都需要) / li>

复杂的问题是我希望将文件传递给python csv 模块,它本来不支持Unicode。我会很乐意接受避免这个问题的答案 - 一旦我得到django玩UTF-8,我相信我可以打扰 csv 做同样的事情。 (类似地,请忽略支持Excel的要求 - 我等待直到CSV工作,然后才解决Excel文件的解析。)



我尝试使用 StringIO mmap codec ,以及各种各样的访问方式 InMemoryUploadedFile 对象中的数据。每种方法都产生了不同的错误,到目前为止还没有完美。这显示了我觉得最接近的一些代码:

  import csv 
import codecs

class CSVParser:
def __init __(self,file):
#'file'被假定为InMemoryUploadedFile对象。
dialect = csv.Sniffer()。sniff(codecs.EncodedFile(file,utf-8)。read(1024))
file.open()#seek to 0
self .reader = csv.reader(codecs.EncodedFile(file,utf-8),
dialect = dialect)
try:
self.field_names = self.reader.next()
除了StopIteration:
#文件为空 - 这是不允许的。
raise ValueError('Unrecognized format(empty file)')

如果len(self.field_names)< = 1:
#这可能不是CSV文件所有。
#请注意,csv模块将(不正确地)解析所有文件,甚至
#二进制数据。这将捕获大多数这样的文件。
raise ValueError('无法识别的格式(列太少)')

#另外的方法被剪切,与发行
无关...

请注意,我没有花太多时间在实际的解析算法上,所以它可能是非常低效的,现在我更关心使编码按预期工作。



问题是结果也没有编码,尽管被包裹在Unicode codecs.EncodedFile 文件包装



编辑:事实证明,上述代码实际上是有效的。 codecs.EncodedFile(文件utf-8)是机票。原来我以为没有工作的原因是我使用的终端不支持UTF-8。感谢任何帮助,请让我知道如果我可以为您提供更多信息。

解决方案

如上所述,我提供的代码片段实际上是按照预期工作的 - 问题在于我的终端,而不是使用python编码。



如果您的视图需要访问UTF-8 UploadedFile ,则可以使用 utf8_file = codecs.EncodedFile(请求。 FILES ['file_field'],utf-8)以正确的编码打开文件对象。



我也注意到,至少对于 InMemoryUploadedFile ,通过 codecs.EncodedFile 打包文件不会重置 seek()文件描述符的位置。要返回到文件的开头(再次,这可能是 InMemoryUploadedFile 具体)我刚刚使用 request.FILES ['file_field']。open ) seek()的位置发回0。


In my django application, I provide a form that allows users to upload a file. The file can be in a variety of formats (Excel, CSV), come from a variety of platforms (Mac, Linux, Windows), and be encoded in a variety of encodings (ASCII, UTF-8).

For the purpose of this question, let's assume that I have a view which is receiving request.FILES['file'], which is an instance of InMemoryUploadedFile, called file. My problem is that InMemoryUploadedFile objects (like file):

  1. Do not support UTF-8 encoding (I see a \xef\xbb\xbf at the beginning of the file, which as I understand is a flag meaning 'this file is UTF-8').
  2. Do not support universal newlines (which probably the majority of the files uploaded to this system will need).

Complicating the issue is that I wish to pass the file in to the python csv module, which does not natively support Unicode. I will happily accept answers that avoid this issue - once I get django playing nice with UTF-8 I'm sure I can bludgeon csv into doing the same. (Similarly, please ignore the requirement to support Excel - I am waiting until CSV works before I tackle parsing Excel files.)

I have tried using StringIO,mmap,codec, and any of a wide variety of ways of accessing the data in an InMemoryUploadedFile object. Each approach has yielded differing errors, none so far have been perfect. This shows some of the code that I feel came the closest:

import csv
import codecs

class CSVParser:
    def __init__(self,file):
        # 'file' is assumed to be an InMemoryUploadedFile object.
        dialect = csv.Sniffer().sniff(codecs.EncodedFile(file,"utf-8").read(1024))
        file.open() # seek to 0
        self.reader = csv.reader(codecs.EncodedFile(file,"utf-8"),
                                 dialect=dialect)
        try:
            self.field_names = self.reader.next()
        except StopIteration:
            # The file was empty - this is not allowed.
            raise ValueError('Unrecognized format (empty file)')

        if len(self.field_names) <= 1:
            # This probably isn't a CSV file at all.
            # Note that the csv module will (incorrectly) parse ALL files, even
            # binary data. This will catch most such files.
            raise ValueError('Unrecognized format (too few columns)')

        # Additional methods snipped, unrelated to issue

Please note that I haven't spent too much time on the actual parsing algorithm so it may be wildly inefficient, right now I'm more concerned with getting encoding to work as expected.

The problem is that the results are also not encoded, despite being wrapped in the Unicode codecs.EncodedFile file wrapper.

EDIT: It turns out, the above code does in fact work. codecs.EncodedFile(file,"utf-8") is the ticket. It turns out the reason I thought it didn't work was that the terminal I was using does not support UTF-8. Live and learn!

Thanks for any help, and please let me know if I can supply you with more information.

解决方案

As mentioned above, the code snippet I provided was in fact working as intended - the problem was with my terminal, and not with python encoding.

If your view needs to access a UTF-8 UploadedFile, you can just use utf8_file = codecs.EncodedFile(request.FILES['file_field'],"utf-8") to open a file object in the correct encoding.

I also noticed that, at least for InMemoryUploadedFiles, opening the file through the codecs.EncodedFile wrapper does NOT reset the seek() position of the file descriptor. To return to the beginning of the file (again, this may be InMemoryUploadedFile specific) I just used request.FILES['file_field'].open() to send the seek() position back to 0.

这篇关于使用通用换行符将Django已上传文件作为UTF-8进行处理的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆