如何从 Kaggle 中将过大的 Kaggle 数据集的一个选定文件加载到 Colab 中 [英] How to load just one chosen file of a way too large Kaggle dataset from Kaggle into Colab

查看:84
本文介绍了如何从 Kaggle 中将过大的 Kaggle 数据集的一个选定文件加载到 Colab 中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果我想从 Kaggle notebook 切换到 Colab notebook,我可以从 Kaggle 下载 notebook,然后在 Google Colab 中打开 notebook.这样做的问题是您通常还需要下载和上传 Kaggle 数据集,这非常费力.

如果你有一个小数据集或者你只需​​要一个较小的数据集文件,你可以将数据集放入 Kaggle notebook 期望的相同文件夹结构中.因此,您需要在 Google Colab 中创建该结构,例如 kaggle/input/ 或其他任何内容,然后将其上传到那里.这不是问题.

如果您有一个大型数据集,您可以:

  • 挂载您的 Google Drive 并使用那里的数据集/文件

  • 或者您按照

    问题来了:这似乎只适用于较小的数据集.我试过了

    kaggle 数据集下载 -d allen-institute-for-ai/CORD-19-research-challenge

    它没有找到那个 API,可能是因为下载 40 GB 的数据受到限制:404 - Not Found.

    在这种情况下,您只能下载需要的文件并使用挂载的 Google Drive,或者您需要使用 Kaggle 而不是 Colab.

    有没有办法只将 40 GB CORD-19 Kaggle 数据集的 800 MB metadata.csv 文件下载到 Colab 中?这是文件信息页面的链接:

    https:///www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge?select=metadata.csv

    我现在已经在 Google Drive 中加载了文件,我很好奇这是否已经是最好的方法.相比之下,如果在 Kaggle 上,整个数据集已经可用,无需下载,并且快速加载,则是相当多的工作.

    PS:将 zip 文件从 Kaggle 下载到 Colab 后,需要解压.再次进一步引用quide:

    <块引用>

    使用 unzip 命令解压数据:

    例如创建一个名为train的目录,

    <代码> !mkdir火车

    在那里解压缩火车数据,

    <代码> !解压 train.zip -d train

    更新:我建议安装 Google Drive

    在尝试了两种方式(挂载 Google Drive 或直接从 Kaggle 加载)后,如果您的架构允许,我建议挂载 Google Drive.这样做的好处是文件只需要上传一次:Google Colab 和 Google Drive 是直接连接的.挂载 Google Drive 需要额外的步骤来从 Kaggle 下载文件,解压缩并将其上传到 Google Drive,并为每个 Python 会话获取并激活一个令牌以安装 Google Drive,但激活令牌会很快完成.使用 Kaggle,您需要在每次会话时将文件从 Kaggle 上传到 Google Colab,这需要更多时间和流量.

    解决方案

    您可以编写一个脚本,只下载某些文件或一个接一个下载的文件:

    导入操作系统os.environ['KAGGLE_USERNAME'] = "YOUR_USERNAME_HERE";os.environ['KAGGLE_KEY'] = "YOUR_TOKEN_HERE";!kaggle 数据集文件 allen-institute-for-ai/CORD-19-research-challenge!kaggle 数据集下载 allen-institute-for-ai/CORD-19-research-challenge -f metadata.csv

    If I want switch from a Kaggle notebook to a Colab notebook, I can download the notebook from Kaggle and open the notebook in Google Colab. The problem with this is that you would normally also need to download and upload the Kaggle dataset, which is quite an effort.

    If you have a small dataset or if you need just a smaller file of a dataset, you can put the datasets into the same folder structure that the Kaggle notebook expects. Thus, you will need to create that structure in Google Colab, like kaggle/input/ or whatever, and upload it there. That is not the issue.

    If you have a large dataset, though, you can either:

    • mount your Google Drive and use the dataset / file from there

    Please follow the steps below to download and use kaggle data within Google Colab:

    1. Go to your Kaggle account, Scroll to API section and Click Expire API Token to remove previous tokens

    2. Click on Create New API Token - It will download kaggle.json file on your machine.

    3. Go to your Google Colab project file and run the following commands:

    1.    ! pip install -q kaggle
      

    2. Choose the kaggle.json file that you downloaded

      from google.colab import files
      
      files.upload()
      

    3. Make directory named kaggle and copy kaggle.json file there.

      ! mkdir ~/.kaggle
      
      ! cp kaggle.json ~/.kaggle/
      

    4. Change the permissions of the file.

      ! chmod 600 ~/.kaggle/kaggle.json
      

    5. That's all ! You can check if everything's okay by running this command.

      ! kaggle datasets list
      

    Download Data

       ! kaggle competitions download -c 'name-of-competition'
    

    Or if you want to download datasets (taken from a comment):

    ! kaggle datasets download -d USERNAME/DATASET_NAME
    

    You can get these dataset names (if unclear) from "copy API command" in the "three-dots drop down" next to "New Notebook" button on the Kaggle dataset page.

    And here comes the issue: This seems to work only on smaller datasets. I have tried it on

    kaggle datasets download -d allen-institute-for-ai/CORD-19-research-challenge
    

    and it does not find that API, probably because downloading 40 GB of data is just restricted: 404 - Not Found.

    In such a case, you can only download the needed file and use the mounted Google Drive, or you need to use Kaggle instead of Colab.

    Is there a way to download into Colab only the 800 MB metadata.csv file of the 40 GB CORD-19 Kaggle dataset? Here is the link to the file's information page:

    https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge?select=metadata.csv

    I have now loaded the file in Google Drive, and I am curious whether that is already the best approach. It is quite a lot of effort if in contrast, on Kaggle, the whole dataset is already available, no need to download, and quickly loaded.

    PS: After having downloaded the zip file from Kaggle to Colab, it needs to be extracted. Further quoting the quide again:

    Use unzip command to unzip the data:

    For example, create a directory named train,

       ! mkdir train
    

    unzip train data there,

       ! unzip train.zip -d train
    

    Update: I recommend mounting Google Drive

    After having tried both ways (either mounting Google Drive or loading directly from Kaggle) I recommend mounting Google Drive if your architecture allows this. The advantage there is that the file needs to be uploaded only once: Google Colab and Google Drive are directly connected. Mounting Google Drive costs you the extra steps to download the file from Kaggle, unzip and upload it to Google Drive, and get and activate a token for each Python session to mount the Google Drive, but activating the token is done quickly. With Kaggle, you need to upload the file from Kaggle to Google Colab at each session instead, which takes more time and traffic.

    解决方案

    You could write a script that downloads only certain files or the files one after the other:

    import os
    
    os.environ['KAGGLE_USERNAME'] = "YOUR_USERNAME_HERE"
    os.environ['KAGGLE_KEY'] = "YOUR_TOKEN_HERE"
    
    !kaggle datasets files allen-institute-for-ai/CORD-19-research-challenge
    
    !kaggle datasets download allen-institute-for-ai/CORD-19-research-challenge -f metadata.csv
    

    这篇关于如何从 Kaggle 中将过大的 Kaggle 数据集的一个选定文件加载到 Colab 中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆