使用通用换行符将Django UploadedFile处理为UTF-8 [英] Processing a Django UploadedFile as UTF-8 with universal newlines

查看:54
本文介绍了使用通用换行符将Django UploadedFile处理为UTF-8的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在我的django应用程序中,我提供了一个允许用户上传文件的表单。该文件可以具有多种格式(Excel,CSV),可以来自多种平台(Mac,Linux,Windows),并可以采用多种编码方式(ASCII,UTF-8)进行编码。



出于这个问题的目的,我们假设我有一个视图正在接收 request.FILES ['file'] ,是 InMemoryUploadedFile 的实例,称为 file 。我的问题是 InMemoryUploadedFile 对象(例如 file ):


  1. 不支持UTF-8编码(我在文件开头看到 \xef\xbb\xbf 据我了解,这是一个标志,意思是此文件为UTF-8。

  2. 不支持通用换行符(可能大多数上传到该系统的文件都需要)。 / li>

复杂的问题是我希望将文件传递给python csv 模块,它本身不支持Unicode。我会很乐意接受避免出现此问题的答案-一旦我让django在UTF-8上表现出色,我相信我可以笨拙地 csv 做同样的事情。 (类似地,请忽略支持Excel的要求-我要等到CSV起作用后才能处理Excel文件的解析。)



我尝试使用 StringIO mmap codec ,以及多种访问方式 InMemoryUploadedFile 对象中的数据。每种方法都产生了不同的错误,但到目前为止没有一个是完美的。这显示了一些我觉得最接近的代码:

  import csv 
导入编解码器

类CSVParser:
def __init __(self,file):
#'file'被假定为InMemoryUploadedFile对象。
方言= csv.Sniffer()。sniff(codecs.EncodedFile(file, utf-8)。read(1024))
file.open()#寻求0
self .reader = csv.reader(codecs.EncodedFile(file, utf-8),
方言=方言)
试试:
self.field_names = self.reader.next()
,但StopIteration除外:
#文件为空-不允许这样做。
提高ValueError('无法识别的格式(空文件)')

如果len(self.field_names)< = 1:
#这可能不是CSV文件所有。
#注意,csv模块将(错误地)解析所有文件,甚至
#二进制数据。这将捕获大多数此类文件。
提高ValueError('无法识别的格式(列太少)')

#附加方法已被删除,与发出


请注意,我没有花太多时间在实际的解析算法上,因此它可能效率很低,现在我更关心的是使编码按预期工作。



问题在于,尽管结果被包裹在Unicode codecs.EncodedFile 文件包装器中,但结果也未进行编码。



编辑:事实证明,以上代码确实有效。 codecs.EncodedFile(file, utf-8)是票证。事实证明,我认为无法使用的原因是我使用的终端不支持UTF-8。生活和学习!

解决方案

如上所述,我提供的代码片段实际上按预期运行-问题出在我的



如果您的视图需要访问UTF-8 UploadedFile ,则可以只需使用 utf8_file = codecs.EncodedFile(request.FILES ['file_field'], utf-8)即可以正确的编码打开文件对象。



我还注意到,至少对于 InMemoryUploadedFile s,通过 codecs.EncodedFile 包装器不会重置文件描述符的 seek()位置。要返回文件的开头(同样,可能是 InMemoryUploadedFile 专用),我只是使用了 request.FILES ['file_field']。open( ) seek()的位置发送回0。


In my django application, I provide a form that allows users to upload a file. The file can be in a variety of formats (Excel, CSV), come from a variety of platforms (Mac, Linux, Windows), and be encoded in a variety of encodings (ASCII, UTF-8).

For the purpose of this question, let's assume that I have a view which is receiving request.FILES['file'], which is an instance of InMemoryUploadedFile, called file. My problem is that InMemoryUploadedFile objects (like file):

  1. Do not support UTF-8 encoding (I see a \xef\xbb\xbf at the beginning of the file, which as I understand is a flag meaning 'this file is UTF-8').
  2. Do not support universal newlines (which probably the majority of the files uploaded to this system will need).

Complicating the issue is that I wish to pass the file in to the python csv module, which does not natively support Unicode. I will happily accept answers that avoid this issue - once I get django playing nice with UTF-8 I'm sure I can bludgeon csv into doing the same. (Similarly, please ignore the requirement to support Excel - I am waiting until CSV works before I tackle parsing Excel files.)

I have tried using StringIO,mmap,codec, and any of a wide variety of ways of accessing the data in an InMemoryUploadedFile object. Each approach has yielded differing errors, none so far have been perfect. This shows some of the code that I feel came the closest:

import csv
import codecs

class CSVParser:
    def __init__(self,file):
        # 'file' is assumed to be an InMemoryUploadedFile object.
        dialect = csv.Sniffer().sniff(codecs.EncodedFile(file,"utf-8").read(1024))
        file.open() # seek to 0
        self.reader = csv.reader(codecs.EncodedFile(file,"utf-8"),
                                 dialect=dialect)
        try:
            self.field_names = self.reader.next()
        except StopIteration:
            # The file was empty - this is not allowed.
            raise ValueError('Unrecognized format (empty file)')

        if len(self.field_names) <= 1:
            # This probably isn't a CSV file at all.
            # Note that the csv module will (incorrectly) parse ALL files, even
            # binary data. This will catch most such files.
            raise ValueError('Unrecognized format (too few columns)')

        # Additional methods snipped, unrelated to issue

Please note that I haven't spent too much time on the actual parsing algorithm so it may be wildly inefficient, right now I'm more concerned with getting encoding to work as expected.

The problem is that the results are also not encoded, despite being wrapped in the Unicode codecs.EncodedFile file wrapper.

EDIT: It turns out, the above code does in fact work. codecs.EncodedFile(file,"utf-8") is the ticket. It turns out the reason I thought it didn't work was that the terminal I was using does not support UTF-8. Live and learn!

解决方案

As mentioned above, the code snippet I provided was in fact working as intended - the problem was with my terminal, and not with python encoding.

If your view needs to access a UTF-8 UploadedFile, you can just use utf8_file = codecs.EncodedFile(request.FILES['file_field'],"utf-8") to open a file object in the correct encoding.

I also noticed that, at least for InMemoryUploadedFiles, opening the file through the codecs.EncodedFile wrapper does NOT reset the seek() position of the file descriptor. To return to the beginning of the file (again, this may be InMemoryUploadedFile specific) I just used request.FILES['file_field'].open() to send the seek() position back to 0.

这篇关于使用通用换行符将Django UploadedFile处理为UTF-8的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆