从下载URL导入Kaggle CSV到Pandas DataFrame [英] Import Kaggle csv from download url to pandas DataFrame

查看:325
本文介绍了从下载URL导入Kaggle CSV到Pandas DataFrame的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在尝试不同的方法来导入SpaceX任务 csv 凝视上的文件直接插入熊猫DataFrame中,没有成功.

I've been trying different methods to import the SpaceX missions csv file on Kaggle directly into a pandas DataFrame, without any success.

我需要发送登录请求.这是我到目前为止的内容:

I'd need to send requests to login. This is what I have so far:

import requests
import pandas as pd
from io import StringIO

# Link to the Kaggle data set & name of zip file
login_url = 'http://www.kaggle.com/account/login?ReturnUrl=/spacex/spacex-missions/downloads/database.csv'

# Kaggle Username and Password
kaggle_info = {'UserName': "user", 'Password': "pwd"}

# Login to Kaggle and retrieve the data.
r = requests.post(login_url, data=kaggle_info, stream=True)
df = pd.read_csv(StringIO(r.text))

r返回页面的html内容. df = pd.read_csv(url)给出CParser错误: CParserError: Error tokenizing data. C error: Expected 1 fields in line 13, saw 6

r is returning the html content of the page. df = pd.read_csv(url) gives a CParser error: CParserError: Error tokenizing data. C error: Expected 1 fields in line 13, saw 6

我一直在寻找解决方案,但到目前为止,我一直没有尝试过.

I've searched for a solution, but so far nothing I've tried worked.

推荐答案

您正在创建流,并将其直接传递给熊猫.我认为您需要将对象之类的文件传递给熊猫.请查看此答案,以寻求可能的解决方案(使用post而不是进入请求).

You are creating a stream and passing it directly to pandas. I think you need to pass a file like object to pandas. Take a look at this answer for a possible solution (using post and not get in the request though).

我还认为您使用的带有重定向的登录网址无法正常使用. 我知道我建议在这里.但是我最终没有使用它是因为发帖请求调用没有处理重定向(我怀疑).

Also i think the login url with redirect that you use is not working as it is. I know i suggested that here. But i ended up not using is because the post request call did not handle the redirect (i suspect).

我最终在我的项目中使用的代码是这样的:

The code i ended up using in my project was this:

def from_kaggle(data_sets, competition):
    """Fetches data from Kaggle

    Parameters
    ----------
    data_sets : (array)
        list of dataset filenames on kaggle. (e.g. train.csv.zip)

    competition : (string)
        name of kaggle competition as it appears in url
        (e.g. 'rossmann-store-sales')

    """
    kaggle_dataset_url = "https://www.kaggle.com/c/{}/download/".format(competition)

    KAGGLE_INFO = {'UserName': config.kaggle_username,
                   'Password': config.kaggle_password}

    for data_set in data_sets:
        data_url = path.join(kaggle_dataset_url, data_set)
        data_output = path.join(config.raw_data_dir, data_set)
        # Attempts to download the CSV file. Gets rejected because we are not logged in.
        r = requests.get(data_url)
        # Login to Kaggle and retrieve the data.
        r = requests.post(r.url, data=KAGGLE_INFO, stream=True)
        # Writes the data to a local file one chunk at a time.
        with open(data_output, 'wb') as f:
            # Reads 512KB at a time into memory
            for chunk in r.iter_content(chunk_size=(512 * 1024)):
                if chunk: # filter out keep-alive new chunks
                    f.write(chunk)

示例用法:

sets = ['train.csv.zip',
        'test.csv.zip',
        'store.csv.zip',
        'sample_submission.csv.zip',]
from_kaggle(sets, 'rossmann-store-sales')

您可能需要解压缩文件.

You might need to unzip the files.

def _unzip_folder(destination):
    """Unzip without regards to the folder structure.

    Parameters
    ----------
    destination : (str)
        Local path and filename where file is should be stored.
    """
    with zipfile.ZipFile(destination, "r") as z:
        z.extractall(config.raw_data_dir)

因此,我从未真正将其直接加载到DataFrame中,而是先将其存储到磁盘中.但是您可以将其修改为使用临时目录,并在读取文件后将其删除.

So i never really directly loaded it into the DataFrame, but rather stored it to disk first. But you could modify it to use a temp directory and just delete the files after you read them.

这篇关于从下载URL导入Kaggle CSV到Pandas DataFrame的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆