如何安排从 BigQuery 表导出到 Cloud Storage? [英] How to schedule an export from a BigQuery table to Cloud Storage?

查看:32
本文介绍了如何安排从 BigQuery 表导出到 Cloud Storage?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已在 BigQuery 中成功安排了查询,结果在我的数据集中保存为表格.我看到了很多关于安排数据传输in到 BigQuery 或 Cloud Storage 的信息,但我还没有找到任何关于安排将 BigQuery 表导出到 Cloud Storage 的信息.

I have successfully scheduled my query in BigQuery, and the result is saved as a table in my dataset. I see a lot of information about scheduling data transfer in to BigQuery or Cloud Storage, but I haven't found anything regarding scheduling an export from a BigQuery table to Cloud Storage yet.

是否可以安排将 BigQuery 表导出到 Cloud Storage,以便我可以进一步安排通过 Google BigQuery 数据传输服务将其 SFTP 发送给我?

Is it possible to schedule an export of a BigQuery table to Cloud Storage so that I can further schedule having it SFTP-ed to me via Google BigQuery Data Transfer Services?

推荐答案

没有用于安排 BigQuery 表导出的托管服务,但一种可行的方法是使用 云函数云调度程序.

There isn't a managed service for scheduling BigQuery table exports, but one viable approach is to use Cloud Functions in conjunction with Cloud Scheduler.

Cloud Function 将包含从 BigQuery 表导出到 Cloud Storage 的必要代码.有多种编程语言可供选择,例如 PythonNode.JSGo.

The Cloud Function would contain the necessary code to export to Cloud Storage from the BigQuery table. There are multiple programming languages to choose from for that, such as Python, Node.JS, and Go.

Cloud Scheduler 会以 cron 格式定期向 Cloud Functions 发送一个 HTTP 调用,然后 Cloud Functions 会被触发并以编程方式运行导出.

Cloud Scheduler would send an HTTP call periodically in a cron format to the Cloud Function which would in turn, get triggered and run the export programmatically.

作为示例,更具体地说,您可以按照以下步骤操作:

As an example and more specifically, you can follow these steps:

  1. 使用 Python 创建云函数使用 HTTP 触发器.要在代码中与 BigQuery 交互,您需要使用 BigQuery 客户端库.使用 from google.cloud import bigquery 导入它.然后,您可以在 ma​​in.py 中使用以下代码创建从 BigQuery 到 Cloud Storage 的导出作业:

  1. Create a Cloud Function using Python with an HTTP trigger. To interact with BigQuery from within the code you need to use the BigQuery client library. Import it with from google.cloud import bigquery. Then, you can use the following code in main.py to create an export job from BigQuery to Cloud Storage:

    # Imports the BigQuery client library
    from google.cloud import bigquery

    def hello_world(request):
        # Replace these values according to your project
        project_name = "YOUR_PROJECT_ID" 
        bucket_name = "YOUR_BUCKET" 
        dataset_name = "YOUR_DATASET" 
        table_name = "YOUR_TABLE" 
        destination_uri = "gs://{}/{}".format(bucket_name, "bq_export.csv.gz")

        bq_client = bigquery.Client(project=project_name)

        dataset = bq_client.dataset(dataset_name, project=project_name)
        table_to_export = dataset.table(table_name)

        job_config = bigquery.job.ExtractJobConfig()
        job_config.compression = bigquery.Compression.GZIP

        extract_job = bq_client.extract_table(
            table_to_export,
            destination_uri,
            # Location must match that of the source table.
            location="US",
            job_config=job_config,
        )  
        return "Job with ID {} started exporting data from {}.{} to {}".format(extract_job.job_id, dataset_name, table_name, destination_uri)

requirements.txt 文件中指定客户端库依赖项通过添加这一行:

Specify the client library dependency in the requirements.txt file by adding this line:

google-cloud-bigquery

  • 创建 Cloud Scheduler 作业.设置您想要的频率要执行的作业.例如,将其设置为 0 1 * * 0每周日凌晨 1 点运行该作业一次.这crontab 工具在进行实验时非常有用使用 cron 调度.

  • Create a Cloud Scheduler job. Set the Frequency you wish for the job to be executed with. For instance, setting it to 0 1 * * 0 would run the job once a week at 1 AM every Sunday morning. The crontab tool is pretty useful when it comes to experimenting with cron scheduling.

    选择 HTTP 作为 Target,将 URL 设置为 Cloud函数的 URL(可以通过选择 Cloud Function 和导航到 Trigger 选项卡),并选择 GET 作为 HTTP 方法.

    Choose HTTP as the Target, set the URL as the Cloud Function's URL (it can be found by selecting the Cloud Function and navigating to the Trigger tab), and as HTTP method choose GET.

    创建后,通过按RUN NOW按钮,您可以测试导出方式行为.但是,在执行此操作之前,请确保 默认 App Engine 服务帐户 至少具有 Cloud IAM roles/storage.objectCreator 角色,否则操作可能会因权限错误而失败.默认 App Engine 服务帐户的格式为 YOUR_PROJECT_ID@appspot.gserviceaccount.com.

    Once created, and by pressing the RUN NOW button, you can test how the export behaves. However, before doing so, make sure the default App Engine service account has at least the Cloud IAM roles/storage.objectCreator role, or otherwise the operation might fail with a permission error. The default App Engine service account has a form of YOUR_PROJECT_ID@appspot.gserviceaccount.com.

    如果你想在不同的表上执行导出,每次执行的数据集和存储桶,但本质上使用相同的 Cloud Function,您可以使用 HTTP POST 方法相反,并配置一个包含所述参数作为数据的 Body,它将传递给 Cloud Function - 虽然,这意味着做代码中的一些小改动.

    If you wish to execute exports on different tables, datasets and buckets for each execution, but essentially employing the same Cloud Function, you can use the HTTP POST method instead, and configure a Body containing said parameters as data, which would be passed on to the Cloud Function - although, that would imply doing some small changes in its code.

    最后,在创建作业时,您可以使用 Cloud Function 返回的 job IDbq CLI 使用 查看导出作业的状态bq show -j .

    Lastly, when the job is created, you can use the Cloud Function's returned job ID and the bq CLI to view the status of the export job with bq show -j <job_id>.

    这篇关于如何安排从 BigQuery 表导出到 Cloud Storage?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

  • 查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆