如何将Google Cloud SQL与Google Big Query集成 [英] How to integrate Google Cloud SQL with Google Big Query

查看:108
本文介绍了如何将Google Cloud SQL与Google Big Query集成的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在设计一个解决方案,使用Google Cloud SQL来存储应用程序正常运行的所有数据(各种OLTP数据)。预计这些数据会随着时间的推移而变得相当大。数据本身本质上是关系型的,因此我们选择了Cloud SQL而不是Cloud Datastore。

这些数据需要输入Big Query进行分析,这需要接近实时分析(作为最好的情况),尽管实际上可能会有一些滞后。但我试图设计一个解决方案,以尽可能减少这种滞后。



我的问题有3个部分 -


  1. 我是否应该使用Cloud SQL来存储数据,然后将其移至BigQuery或更改基本设计本身,并使用BigQuery最初存储数据? BigQuery适合用于常规的低延迟OLTP工作负载吗?(我不这么认为 - 我的假设是正确的?)

  2. /将Cloud SQL数据加载到BigQuery中的最佳实践,并使这种集成接近实时?
  3. Cloud Dataflow是一个不错的选择吗?如果我将Cloud SQL连接到Cloud DataFlow并进一步连接到BigQuery,它会工作吗?或者有什么其他方式可以达到这个目标?(问题2)?

  4. .com / posts / bigquery-wepayrel =nofollow noreferrer> https://wecode.wepay.com/posts/bigquery-wepay



    MySQL到GCS操作符对MySQL
    表执行SELECT查询。 SELECT拉取所有大于(或等于)最后
    高水印的数据。高位水印是
    表的主键(如果表是仅附加的)或修改时间戳
    列(如果表接收更新)。同样,SELECT语句
    也会返回一个或多个时间点(或行)以捕获最后一次查询中可能丢失的
    行(由于上述问题)。


    通过Airflow,他们可以每隔15分钟将BigQuery与MySQL数据库保持同步。


    I am designing a solution in which Google Cloud SQL will be used to store all data from the regular functioning of the app(kind of OLTP data). The data is expected to grow over time into pretty large size. The data itself is relational in nature and hence we have chosen Cloud SQL instead of Cloud Datastore.

    This data needs to be fed into Big Query for analytics and this needs to be near real-time analytics (as the best case), although realistically some lag can be expected. But I am trying to design a solution which reduces this lag to minimum possible.

    My question has 3 parts -

    1. Should I use Cloud SQL for storing data and then move it to BigQuery or change the basic design itself and use BigQuery for storing the data initially as well? Is BigQuery suitable for use for regular, low-latency OLTP workloads?(I don't think so - is my assumption correct?)

    2. What is the recommended/best practice for loading Cloud SQL data into BigQuery and have this integration work near real-time?

    3. Is Cloud Dataflow a good option? If I connect Cloud SQL to Cloud DataFlow and further to BigQuery - will it work? Or is there any other way to achieve this which is better(as asked in question 2)?

    解决方案

    Take a look at how WePay does this:

    The MySQL to GCS operator executes a SELECT query against a MySQL table. The SELECT pulls all data greater than (or equal to) the last high watermark. The high watermark is either the primary key of the table (if the table is append-only), or a modification timestamp column (if the table receives updates). Again, the SELECT statement also goes back a bit in time (or rows) to catch potentially dropped rows from the last query (due to the issues mentioned above).

    With Airflow they manage to keep BigQuery synchronized to their MySQL database every 15 minutes.

    这篇关于如何将Google Cloud SQL与Google Big Query集成的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆