高效地将 Pandas 数据帧写入 Google BigQuery [英] Efficiently write a Pandas dataframe to Google BigQuery

查看:17
本文介绍了高效地将 Pandas 数据帧写入 Google BigQuery的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 pandas.DataFrame.to_gbq() 记录的函数将 pandas.DataFrame 上传到 Google Big Query/pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_gbq.html#pandas-dataframe-to-gbq" rel="nofollow noreferrer">此处.问题是 to_gbq() 需要 2.3 分钟,而直接上传到 Google Cloud Storage 需要不到一分钟.我计划上传一堆数据帧(约 32 个),每个数据帧的大小相似,所以我想知道什么是更快的替代方案.

I'm trying to upload a pandas.DataFrame to Google Big Query using the pandas.DataFrame.to_gbq() function documented here. The problem is that to_gbq() takes 2.3 minutes while uploading directly to Google Cloud Storage takes less than a minute. I'm planning to upload a bunch of dataframes (~32) each one with a similar size, so I want to know what is the faster alternative.

这是我正在使用的脚本:

This is the script that I'm using:

dataframe.to_gbq('my_dataset.my_table', 
                 'my_project_id',
                 chunksize=None, # I have tried with several chunk sizes, it runs faster when it's one big chunk (at least for me)
                 if_exists='append',
                 verbose=False
                 )

dataframe.to_csv(str(month) + '_file.csv') # the file size its 37.3 MB, this takes almost 2 seconds 
# manually upload the file into GCS GUI
print(dataframe.shape)
(363364, 21)

我的问题是,什么更快?

My question is, what is faster?

  1. 使用pandas.DataFrame.to_gbq()函数上传Dataframe
  2. Dataframe 保存为 CSV,然后使用 Python API
  3. Dataframe 保存为 CSV,然后使用此过程将文件上传到 Google Cloud Storage然后从 BigQuery 读取它
  1. Upload Dataframe using pandas.DataFrame.to_gbq() function
  2. Saving Dataframe as CSV and then upload it as a file to BigQuery using the Python API
  3. Saving Dataframe as CSV and then upload the file to Google Cloud Storage using this procedure and then reading it from BigQuery

更新:

备选方案 1 似乎比备选方案 2 快,(使用 pd.DataFrame.to_csv()load_data_from_file() 17.9 秒平均更多,有 3 个循环):

Alternative 1 seems faster than Alternative 2 , (using pd.DataFrame.to_csv() and load_data_from_file() 17.9 secs more in average with 3 loops):

def load_data_from_file(dataset_id, table_id, source_file_name):
    bigquery_client = bigquery.Client()
    dataset_ref = bigquery_client.dataset(dataset_id)
    table_ref = dataset_ref.table(table_id)
    
    with open(source_file_name, 'rb') as source_file:
        # This example uses CSV, but you can use other formats.
        # See https://cloud.google.com/bigquery/loading-data
        job_config = bigquery.LoadJobConfig()
        job_config.source_format = 'text/csv'
        job_config.autodetect=True
        job = bigquery_client.load_table_from_file(
            source_file, table_ref, job_config=job_config)

    job.result()  # Waits for job to complete

    print('Loaded {} rows into {}:{}.'.format(
        job.output_rows, dataset_id, table_id))

推荐答案

我使用以下代码在 Datalab 中对替代方案 1 和替代方案 3 进行了比较:

I did the comparison for alternative 1 and 3 in Datalab using the following code:

from datalab.context import Context
import datalab.storage as storage
import datalab.bigquery as bq
import pandas as pd
from pandas import DataFrame
import time

# Dataframe to write
my_data = [{1,2,3}]
for i in range(0,100000):
    my_data.append({1,2,3})
not_so_simple_dataframe = pd.DataFrame(data=my_data,columns=['a','b','c'])

#Alternative 1
start = time.time()
not_so_simple_dataframe.to_gbq('TestDataSet.TestTable', 
                 Context.default().project_id,
                 chunksize=10000, 
                 if_exists='append',
                 verbose=False
                 )
end = time.time()
print("time alternative 1 " + str(end - start))

#Alternative 3
start = time.time()
sample_bucket_name = Context.default().project_id + '-datalab-example'
sample_bucket_path = 'gs://' + sample_bucket_name
sample_bucket_object = sample_bucket_path + '/Hello.txt'
bigquery_dataset_name = 'TestDataSet'
bigquery_table_name = 'TestTable'

# Define storage bucket
sample_bucket = storage.Bucket(sample_bucket_name)

# Create or overwrite the existing table if it exists
table_schema = bq.Schema.from_dataframe(not_so_simple_dataframe)

# Write the DataFrame to GCS (Google Cloud Storage)
%storage write --variable not_so_simple_dataframe --object $sample_bucket_object

# Write the DataFrame to a BigQuery table
table.insert_data(not_so_simple_dataframe)
end = time.time()
print("time alternative 3 " + str(end - start))

这里是 n = {10000,100000,1000000} 的结果:

and here are the results for n = {10000,100000,1000000}:

n       alternative_1  alternative_3
10000   30.72s         8.14s
100000  162.43s        70.64s
1000000 1473.57s       688.59s

从结果来看,方案 3 比方案 1 快.

Judging from the results, alternative 3 is faster than alternative 1.

这篇关于高效地将 Pandas 数据帧写入 Google BigQuery的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆