如何安排从 BigQuery 表到 Cloud Storage 的导出? [英] How to schedule an export from a BigQuery table to Cloud Storage?

查看:29
本文介绍了如何安排从 BigQuery 表到 Cloud Storage 的导出?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已在 BigQuery 中成功安排我的查询,并将结果保存为我的数据集中的表格.我看到了很多关于安排输入到 BigQuery 或 Cloud Storage 的数据传输的信息,但我还没有找到任何关于安排将 BigQuery 表导出到 Cloud Storage 的任何信息.

I have successfully scheduled my query in BigQuery, and the result is saved as a table in my dataset. I see a lot of information about scheduling data transfer in to BigQuery or Cloud Storage, but I haven't found anything regarding scheduling an export from a BigQuery table to Cloud Storage yet.

是否可以安排将 BigQuery 表导出到 Cloud Storage,以便我可以进一步安排通过 Google BigQuery 数据传输服务将其通过 SFTP 传输给我?

Is it possible to schedule an export of a BigQuery table to Cloud Storage so that I can further schedule having it SFTP-ed to me via Google BigQuery Data Transfer Services?

推荐答案

没有用于调度 BigQuery 表导出的托管服务,但一种可行的方法是使用 云函数云调度器.

There isn't a managed service for scheduling BigQuery table exports, but one viable approach is to use Cloud Functions in conjunction with Cloud Scheduler.

Cloud Function 将包含从 BigQuery 表导出到 Cloud Storage 的必要代码.有多种编程语言可供选择,例如 PythonNode.JSGo.

The Cloud Function would contain the necessary code to export to Cloud Storage from the BigQuery table. There are multiple programming languages to choose from for that, such as Python, Node.JS, and Go.

Cloud Scheduler 会以 cron 格式定期向 Cloud Function 发送 HTTP 调用,Cloud Function 反过来会被触发并以编程方式运行导出.

Cloud Scheduler would send an HTTP call periodically in a cron format to the Cloud Function which would in turn, get triggered and run the export programmatically.

作为一个示例,更具体地说,您可以按照以下步骤操作:

As an example and more specifically, you can follow these steps:

  1. 使用 Python 创建云函数使用 HTTP 触发器.要在代码中与 BigQuery 交互,您需要使用 BigQuery 客户端库.使用 from google.cloud import bigquery 将其导入.然后,您可以在 ma​​in.py 中使用以下代码创建从 BigQuery 到 Cloud Storage 的导出作业:

  1. Create a Cloud Function using Python with an HTTP trigger. To interact with BigQuery from within the code you need to use the BigQuery client library. Import it with from google.cloud import bigquery. Then, you can use the following code in main.py to create an export job from BigQuery to Cloud Storage:

    # Imports the BigQuery client library
    from google.cloud import bigquery

    def hello_world(request):
        # Replace these values according to your project
        project_name = "YOUR_PROJECT_ID" 
        bucket_name = "YOUR_BUCKET" 
        dataset_name = "YOUR_DATASET" 
        table_name = "YOUR_TABLE" 
        destination_uri = "gs://{}/{}".format(bucket_name, "bq_export.csv.gz")

        bq_client = bigquery.Client(project=project_name)

        dataset = bq_client.dataset(dataset_name, project=project_name)
        table_to_export = dataset.table(table_name)

        job_config = bigquery.job.ExtractJobConfig()
        job_config.compression = bigquery.Compression.GZIP

        extract_job = bq_client.extract_table(
            table_to_export,
            destination_uri,
            # Location must match that of the source table.
            location="US",
            job_config=job_config,
        )  
        return "Job with ID {} started exporting data from {}.{} to {}".format(extract_job.job_id, dataset_name, table_name, destination_uri)

requirements.txt 文件中指定客户端库依赖项通过添加这一行:

Specify the client library dependency in the requirements.txt file by adding this line:

google-cloud-bigquery

  • 创建 Cloud Scheduler 作业.设置您想要的频率要执行的作业.例如,将其设置为 0 1 * * 0将在每周日凌晨 1 点每周运行一次作业.这crontab 工具 在进行实验时非常有用与 cron 调度.

  • Create a Cloud Scheduler job. Set the Frequency you wish for the job to be executed with. For instance, setting it to 0 1 * * 0 would run the job once a week at 1 AM every Sunday morning. The crontab tool is pretty useful when it comes to experimenting with cron scheduling.

    选择 HTTP 作为 Target,将 URL 设置为 Cloud函数的 URL(可以通过选择 Cloud Function 和导航到触发器选项卡),并作为 HTTP 方法选择 GET.

    Choose HTTP as the Target, set the URL as the Cloud Function's URL (it can be found by selecting the Cloud Function and navigating to the Trigger tab), and as HTTP method choose GET.

    创建后,按下立即运行按钮,您可以测试如何导出行为.但是,在这样做之前,请确保 默认 App Engine 服务帐户 至少具有 Cloud IAM roles/storage.objectCreator 角色,否则操作可能会因权限错误而失败.默认 App Engine 服务帐户的格式为 YOUR_PROJECT_ID@appspot.gserviceaccount.com.

    Once created, and by pressing the RUN NOW button, you can test how the export behaves. However, before doing so, make sure the default App Engine service account has at least the Cloud IAM roles/storage.objectCreator role, or otherwise the operation might fail with a permission error. The default App Engine service account has a form of YOUR_PROJECT_ID@appspot.gserviceaccount.com.

    如果您希望在不同的表上执行导出,每次执行的数据集和存储桶,但本质上使用相同的 Cloud Function,您可以使用 HTTP POST 方法而是配置一个包含上述参数的 Body 作为数据,将被传递给 Cloud Function - 虽然,这意味着做对其代码进行了一些小改动.

    If you wish to execute exports on different tables, datasets and buckets for each execution, but essentially employing the same Cloud Function, you can use the HTTP POST method instead, and configure a Body containing said parameters as data, which would be passed on to the Cloud Function - although, that would imply doing some small changes in its code.

    最后,创建作业后,您可以使用 Cloud Function 返回的 job IDbq CLI 通过 查看导出作业的状态bq show -j <job_id>.

    Lastly, when the job is created, you can use the Cloud Function's returned job ID and the bq CLI to view the status of the export job with bq show -j <job_id>.

    这篇关于如何安排从 BigQuery 表到 Cloud Storage 的导出?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

  • 查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆