Google数据流:在BigQuery中通过流水线插入+更新 [英] Google Dataflow: insert + update in BigQuery in a streaming pipeline

查看:97
本文介绍了Google数据流:在BigQuery中通过流水线插入+更新的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

主要对象

我从pub/sub中读取输入的python流管道.

A python streaming pipeline in which I read the input from pub/sub.

分析输入后,有两个选项可用:

After the input is analyzed, two option are available:

  • 如果x = 1->插入
  • 如果x = 2->更新

测试

  • 使用apache Beam函数无法完成此操作,因此您需要使用BigQuery的0.25 API(当前是Google Dataflow支持的版本)进行开发

问题

  • 插入的记录仍在BigQuery缓冲区中,因此update语句失败:

  • The inserted record are still in the BigQuery buffer, so the update statement fail:

     UPDATE or DELETE statement over table table would affect rows in the streaming buffer, which is not supported

代码

插入

def insertCanonicalBQ(input):
    from google.cloud import bigquery
    client = bigquery.Client(project='project')
    dataset = client.dataset('dataset')
    table = dataset.table('table' )
    table.reload()
    table.insert_data(
        rows=[[values]])

更新

def UpdateBQ(input):
    from google.cloud import bigquery
    import uuid
    import time
    client = bigquery.Client()
    STD= "#standardSQL"
    QUERY= STD + "\n" + """UPDATE table SET field1 = 'XXX' WHERE field2=  'YYY'"""
    client.use_legacy_sql = False    
    query_job = client.run_async_query(query=QUERY, job_name='temp-query-job_{}'.format(uuid.uuid4()))  # API request
    query_job.begin()
    while True:
         query_job.reload()  # Refreshes the state via a GET request.
         if query_job.state == 'DONE':
             if query_job.error_result:
                 raise RuntimeError(query_job.errors)
             print "done"
             return input
             time.sleep(1)

推荐答案

即使该行不在流缓冲区中,这仍然不是在BigQuery中解决此问题的方法.BigQuery存储更适合批量修改,而不是像这样通过 UPDATE 修改单个实体.您的模式与我希望从事务性而不是分析性用例获得的结果保持一致.

Even if the row wasn't in the streaming buffer, this still wouldn't be the way to approach this problem in BigQuery. BigQuery storage is better suited for bulk mutations rather than mutating individual entities like this via UPDATE. Your pattern is aligned with something I'd expect from an transactional rather than analytical use case.

为此考虑一个基于附加的模式.每次您处理实体消息时,都会通过流插入将其写入BigQuery.然后,在需要时,您可以通过查询获取所有实体的最新版本.

Consider an append-based pattern for this. Each time you process an entity message write it to BigQuery via streaming insert. Then, when needed you can get the latest version of all entities via a query.

作为示例,让我们假设一个任意模式: idfield 是您的唯一实体键/标识符,而 message_time 表示消息发出的时间.您的实体可能还有许多其他字段.要获取实体的最新版本,我们可以运行以下查询(并可能将其写入另一个表):

As an example, let's assume an arbitrary schema: idfield is your unique entity key/identifier, and message_time represents the time the message was emitted. Your entities may have many other fields. To get the latest version of the entities, we could run the following query (and possibly write this to another table):

#standardSQL
SELECT
  idfield,
  ARRAY_AGG(
    t ORDER BY message_time DESC LIMIT 1
  )[OFFSET(0)].* EXCEPT (idfield)
FROM `myproject.mydata.mytable` AS t
GROUP BY idfield

此方法的另一个优点是,它还允许您在任意时间点执行分析.要对一个小时前的实体状态进行分析,只需添加WHERE子句即可: WHERE message_time< = TIMESTAMP_SUB(CURRENT_TIMESTAMP(),INTERVAL 1 HOUR)

An additional advantage of this approach is that it also allows you to perform analysis at arbitrary points of time. To perform an analysis of the entities as of their state an hour ago would simply involve adding a WHERE clause: WHERE message_time <= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR)

这篇关于Google数据流:在BigQuery中通过流水线插入+更新的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆