从 Google Cloud 存储读取 csv 到 Pandas 数据帧 [英] Read csv from Google Cloud storage to pandas dataframe

查看:23
本文介绍了从 Google Cloud 存储读取 csv 到 Pandas 数据帧的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将 Google Cloud Storage 存储桶中的 csv 文件读取到熊猫数据帧上.

将pandas导入为pd导入 matplotlib.pyplot 作为 plt将 seaborn 作为 sns 导入%matplotlib 内联从 io 导入 BytesIO从 google.cloud 导入存储storage_client = storage.Client()bucket = storage_client.get_bucket('createbucket123')blob = bucket.blob('my.csv')路径 = "gs://createbucket123/my.csv"df = pd.read_csv(路径)

它显示此错误消息:

FileNotFoundError: 文件 b'gs://createbucket123/my.csv' 不存在

我做错了什么,我找不到任何不涉及 google datalab 的解决方案?

解决方案

UPDATE

从 pandas 0.24 版本开始,read_csv 支持直接从 Google Cloud Storage 读取.只需像这样提供指向存储桶的链接:

df = pd.read_csv('gs://bucket/your_path.csv')

read_csv 然后将使用 gcsfs 模块读取数据帧,这意味着它必须被安装(否则你会得到一个指向缺失依赖项的异常).

为了完整起见,我留下了其他三个选项.

  • 自制代码
  • gcsfs
  • dask

我将在下面介绍它们.

困难的方法:自己动手编写代码

我编写了一些方便的函数来从 Google 存储中读取数据.为了使其更具可读性,我添加了类型注释.如果您碰巧使用的是 Python 2,只需删除这些代码,代码就会完全一样.

它在公共和私人数据集上同样适用,前提是您已获得授权.在这种方法中,您无需先将数据下载到本地驱动器.

使用方法:

fileobj = get_byte_fileobj('my-project', 'my-bucket', 'my-path')df = pd.read_csv(fileobj)

代码:

from io import BytesIO, StringIO从 google.cloud 导入存储从 google.oauth2 导入 service_accountdef get_byte_fileobj(project: str,桶:str,路径:str,service_account_credentials_path: str = None) ->字节IO:"从 Google Storage 上的给定 blob 中检索数据并将其作为文件对象传递.:param path: 桶内的路径:param project: 项目名称:param bucket_name: 存储桶的名称:param service_account_credentials_path: 凭证路径.提示:可以存储为环境变量,例如os.getenv('GOOGLE_APPLICATION_CREDENTIALS_DSPLATFORM'):return: 文件对象 (BytesIO)"blob = _get_blob(bucket, path, project, service_account_credentials_path)byte_stream = BytesIO()blob.download_to_file(byte_stream)byte_stream.seek(0)返回字节流def get_bytestring(项目:str,桶:str,路径:str,service_account_credentials_path: str = None) ->字节:"从 Google Storage 上的给定 blob 中检索数据并将其作为字节字符串传递.:param path: 桶内的路径:param project: 项目名称:param bucket_name: 存储桶的名称:param service_account_credentials_path: 凭证路径.提示:可以存储为环境变量,例如os.getenv('GOOGLE_APPLICATION_CREDENTIALS_DSPLATFORM'):return: 字节串(需要解码)"blob = _get_blob(bucket, path, project, service_account_credentials_path)s = blob.download_as_string()返回def _get_blob(bucket_name, path, project, service_account_credentials_path):凭证 = service_account.Credentials.from_service_account_file(service_account_credentials_path) if service_account_credentials_path else Nonestorage_client = storage.Client(project=project,credentials=credentials)存储桶 = storage_client.get_bucket(bucket_name)blob = bucket.blob(路径)返回 blob

gcsfs

gcsfs 是Google Cloud Storage 的 Pythonic 文件系统".>

使用方法:

将pandas导入为pd导入 gcsfsfs = gcsfs.GCSFileSystem(project='my-project')使用 fs.open('bucket/path.csv') 作为 f:df = pd.read_csv(f)

暗影

Dask为分析提供高级并行性,为您喜爱的工具实现大规模性能".当您需要在 Python 中处理大量数据时,这非常有用.Dask 尝试模仿大部分 pandas API,使其易于新手使用.

这是read_csv

使用方法:

将 dask.dataframe 导入为 dddf = dd.read_csv('gs://bucket/data.csv')df2 = dd.read_csv('gs://bucket/path/*.csv') # 不错!# df 现在是 Dask 数据帧,准备进行分布式处理# 如果你想拥有 Pandas 版本,只需:df_pd = df.compute()

I am trying to read a csv file present on the Google Cloud Storage bucket onto a panda dataframe.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from io import BytesIO

from google.cloud import storage

storage_client = storage.Client()
bucket = storage_client.get_bucket('createbucket123')
blob = bucket.blob('my.csv')
path = "gs://createbucket123/my.csv"
df = pd.read_csv(path)

It shows this error message:

FileNotFoundError: File b'gs://createbucket123/my.csv' does not exist

What am I doing wrong, I am not able to find any solution which does not involve google datalab?

解决方案

UPDATE

As of version 0.24 of pandas, read_csv supports reading directly from Google Cloud Storage. Simply provide link to the bucket like this:

df = pd.read_csv('gs://bucket/your_path.csv')

The read_csv will then use gcsfs module to read the Dataframe, which means it had to be installed (or you will get an exception pointing at missing dependency).

I leave three other options for the sake of completeness.

  • Home-made code
  • gcsfs
  • dask

I will cover them below.

The hard way: do-it-yourself code

I have written some convenience functions to read from Google Storage. To make it more readable I added type annotations. If you happen to be on Python 2, simply remove these and code will work all the same.

It works equally on public and private data sets, assuming you are authorised. In this approach you don't need to download first the data to your local drive.

How to use it:

fileobj = get_byte_fileobj('my-project', 'my-bucket', 'my-path')
df = pd.read_csv(fileobj)

The code:

from io import BytesIO, StringIO
from google.cloud import storage
from google.oauth2 import service_account

def get_byte_fileobj(project: str,
                     bucket: str,
                     path: str,
                     service_account_credentials_path: str = None) -> BytesIO:
    """
    Retrieve data from a given blob on Google Storage and pass it as a file object.
    :param path: path within the bucket
    :param project: name of the project
    :param bucket_name: name of the bucket
    :param service_account_credentials_path: path to credentials.
           TIP: can be stored as env variable, e.g. os.getenv('GOOGLE_APPLICATION_CREDENTIALS_DSPLATFORM')
    :return: file object (BytesIO)
    """
    blob = _get_blob(bucket, path, project, service_account_credentials_path)
    byte_stream = BytesIO()
    blob.download_to_file(byte_stream)
    byte_stream.seek(0)
    return byte_stream

def get_bytestring(project: str,
                   bucket: str,
                   path: str,
                   service_account_credentials_path: str = None) -> bytes:
    """
    Retrieve data from a given blob on Google Storage and pass it as a byte-string.
    :param path: path within the bucket
    :param project: name of the project
    :param bucket_name: name of the bucket
    :param service_account_credentials_path: path to credentials.
           TIP: can be stored as env variable, e.g. os.getenv('GOOGLE_APPLICATION_CREDENTIALS_DSPLATFORM')
    :return: byte-string (needs to be decoded)
    """
    blob = _get_blob(bucket, path, project, service_account_credentials_path)
    s = blob.download_as_string()
    return s


def _get_blob(bucket_name, path, project, service_account_credentials_path):
    credentials = service_account.Credentials.from_service_account_file(
        service_account_credentials_path) if service_account_credentials_path else None
    storage_client = storage.Client(project=project, credentials=credentials)
    bucket = storage_client.get_bucket(bucket_name)
    blob = bucket.blob(path)
    return blob

gcsfs

gcsfs is a "Pythonic file-system for Google Cloud Storage".

How to use it:

import pandas as pd
import gcsfs

fs = gcsfs.GCSFileSystem(project='my-project')
with fs.open('bucket/path.csv') as f:
    df = pd.read_csv(f)

dask

Dask "provides advanced parallelism for analytics, enabling performance at scale for the tools you love". It's great when you need to deal with large volumes of data in Python. Dask tries to mimic much of the pandas API, making it easy to use for newcomers.

Here is the read_csv

How to use it:

import dask.dataframe as dd

df = dd.read_csv('gs://bucket/data.csv')
df2 = dd.read_csv('gs://bucket/path/*.csv') # nice!

# df is now Dask dataframe, ready for distributed processing
# If you want to have the pandas version, simply:
df_pd = df.compute()

这篇关于从 Google Cloud 存储读取 csv 到 Pandas 数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆