在Django中使用Dask读取上传的CSV时出错:“ InMemoryUploadedFile”对象没有属性“ startswith” [英] Error Reading an Uploaded CSV Using Dask in Django: 'InMemoryUploadedFile' object has no attribute 'startswith'

查看:403
本文介绍了在Django中使用Dask读取上传的CSV时出错:“ InMemoryUploadedFile”对象没有属性“ startswith”的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在构建一个Django应用,该应用可以使用户使用FormField通过表单上传CSV。导入CSV后,我将使用Pandas read_csv(filename)命令读取CSV,以便可以使用Pandas对CSV进行一些处理。

I'm building a Django app that enables users to upload a CSV via a form using a FormField. Once the CSV is imported I use the Pandas read_csv(filename) command to read in the CSV so I can do some processing on the CSV using Pandas.

我最近开始学习真正有用的Dask库,因为上传的文件的大小可能大于内存。使用Pandas pd.read_csv(filename)时一切正常,但是当我尝试使用Dask dd.read_csv(filename)时,出现错误 InMemoryUploadedFile'对象没有属性'startswith'。

I've recently started learning the really useful Dask library because the size of the uploaded files can be larger than memory. Everything works fine when using Pandas pd.read_csv(filename) but when I try and use Dask dd.read_csv(filename) I get the error "'InMemoryUploadedFile' object has no attribute 'startswith'".

我是Django,Pandas和Dask的新手。我搜索过高低,在Google上与Dask相关联时似乎找不到此错误。

I'm pretty new to Django, Pandas and Dask. I've searched high and low and can't seem to find this error when associated with Dask anywhere on Google.

这是我在下面尝试使用的代码(只是相关位...我希望):

Here is the code I'm trying to use below (just the relevant bits... I hope):

forms.py 内部,我有:

class ImportFileForm(forms.Form):
    file_name = forms.FileField(label='Select a csv',validators=[validate_file_extension, file_size])

内部 views.py

import pandas as pd
import codecs
import dask.array as da
import dask.dataframe as dd

from dask.distributed import Client
client = Client()

def import_csv(request):

    if request.method == 'POST':
        form = ImportFileForm(request.POST, request.FILES)
        if form.is_valid():

             utf8_file = codecs.EncodedFile(request.FILES['file_name'].open(),"utf-8")

             # IF I USE THIS PANDAS LINE IT WORKS AND I CAN THEN USE PANDAS TO PROCESS THE FILE
             #df_in = pd.read_csv(utf8_file)

             # IF I USE THIS DASK LINE IT DOES NOT WORK AND PRODUCES THE ERROR
             df_in = dd.read_csv(utf8_file)

这是我输出的错误得到:

And here is the error output I'm getting:

AttributeError at /import_data/import_csv/
'InMemoryUploadedFile' object has no attribute 'startswith'

/home/username/projects/myproject/import_data/services.py in save_imported_doc
    df_in = dd.read_csv(utf8_file) …
▶ Local vars
/home/username/anaconda3/lib/python3.7/site-packages/dask/dataframe/io/csv.py in read
            **kwargs …
▶ Local vars
/home/username/anaconda3/lib/python3.7/site-packages/dask/dataframe/io/csv.py in read_pandas
        **(storage_options or {}) …
▶ Local vars
/home/username/anaconda3/lib/python3.7/site-packages/dask/bytes/core.py in read_bytes
    fs, fs_token, paths = get_fs_token_paths(urlpath, mode="rb", storage_options=kwargs) …
▶ Local vars
/home/username/anaconda3/lib/python3.7/site-packages/fsspec/core.py in get_fs_token_paths
        path = cls._strip_protocol(urlpath) …
▶ Local vars
/home/username/anaconda3/lib/python3.7/site-packages/fsspec/implementations/local.py in _strip_protocol
        if path.startswith("file://"): …
▶ Local vars
/home/username/anaconda3/lib/python3.7/codecs.py in __getattr__
        return getattr(self.stream, name) 


推荐答案

我终于开始工作了。这是一个基于@mdurant的答案的Django特定解决方案,他很幸运地为我指出了正确的方向。

I finally got it working. Here's a Django specific solution building on the answer from @mdurant who thankfully pointed me in the right direction.

默认情况下,Django将文件存储在2.5MB以下,因此Dask是就像Dask要求在实际存储中找到位置一样,它无法像Pandas那样访问它。但是,当文件超过2.5MB时,Django会将文件存储在temp文件夹中,然后可以使用Django命令temporary_file_path()进行定位。然后,Dask可以直接使用此临时文件路径。我发现了一些有关Django在其文档中如何在后台实际处理文件的非常有用的信息: https://docs.djangoproject.com/en/3.0/ref/files/uploads/#custom-upload-handlers

By default Django stores files under 2.5MB in memory and so Dask isn't able to access it in the way Pandas does as Dask asks for a location in actual storage. However, when the file is over 2.5MB Django stores the file in a temp folder which can then be located with the Django command temporary_file_path(). This temp file path can then be used directly by Dask. I found some really useful information about how Django actually handles files in the background in their docs: https://docs.djangoproject.com/en/3.0/ref/files/uploads/#custom-upload-handlers.

如果您无法预先预测用户上传的文件大小(例如我的情况),并且碰巧小于2.5MB,则可以在Django设置文件中更改FILE_UPLOAD_HANDLERS,以便将其全部写入

In case you can't predict in advance your user uploaded file sizes (as is in my case) and you happen to have a file less than 2.5MB you can change FILE_UPLOAD_HANDLERS in your Django settings file so that it writes all files to a temp storage folder regardless of size so it can always be accessed by Dask.

这是我更改代码的方式,以防对其他人有用。

Here is how I changed my code in case this is helpful for anyone else in the same situation.

views.py

def import_csv(request):

    if request.method == 'POST':
        form = ImportFileForm(request.POST, request.FILES)
        if form.is_valid():

             # the temporary_file_path() shows Dask where to find the file
             df_in = dd.read_csv(request.FILES['file_name'].temporary_file_path())

settings.py 中添加以下设置可使Django始终将上传的文件写入临时存储,无论文件是否小于2.5MB,以便Dask始终可以访问它。

And in settings.py adding in the setting as below makes Django always write an uploaded file to temp storage whether the file is under 2.5MB or not so it can always be accessed by Dask

FILE_UPLOAD_HANDLERS = ['django.core.files.uploadhandler.TemporaryFileUploadHandler',]

这篇关于在Django中使用Dask读取上传的CSV时出错:“ InMemoryUploadedFile”对象没有属性“ startswith”的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆