从非分区表迁移到分区表 [英] Migrating from non-partitioned to Partitioned tables

查看:24
本文介绍了从非分区表迁移到分区表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

6 月,BQ 团队

第 2 步 – 逐一处理分区 仅扫描相应列(无全表扫描) – 插入相应分区

#standardSQL选择 r.*从数据透视表,UNNEST(day20170101) AS r

使用名为 mytable$20160101 的目标表从 Web UI 运行上述查询

你可以在第二天运行相同的

#standardSQL选择 r.*FROM pivot_table, UNNEST(day20170102) AS r

现在你的目标表应该是 mytable$20160102 等等

您应该能够使用您选择的任何客户端自动化/编写此步骤

如何使用上述方法有很多变化 - 这取决于您的创造力

注意:BigQuery 允许表中最多有 10000 列,因此一年中各天的 365 列在这里绝对不是问题:o)除非对新分区的回溯时间有限制——我听说(但还没有机会检查)现在不超过 90 天

<块引用>

更新

请注意:上面的版本有一些额外的逻辑,将所有聚合的单元格打包成尽可能少的最终行数.

ROW_NUMBER() OVER(PARTITION BY d) AS 行
然后
按行分组
随着
ARRAY_CONCAT_AGG(…)
这样做

当原始表中的行大小不是那么大时,这很有效,因此最终组合的行大小仍将在 BigQuery 的行大小限制范围内(我认为目前为 10 MB)

如果您的源表的行大小已经接近该限制 - 使用以下调整后的版本

在这个版本中 – 删除了分组,使得每一行只有一列的值

#standardSQL选择当 d = 'day20170101' 然后 r 结束为 day20170101,当 d = 'day20170102' 然后 r 结束为 day20170102,如果 d = 'day20170103' 然后 r 结束为 day20170103,如果 d = 'day20170104' 然后 r 结束为 day20170104,如果 d = 'day20170105' 然后 r 结束为 day20170105,如果 d = 'day20170106' 然后 r 结束为 day20170106,当 d = 'day20170107' 然后 r 结束为 day20170107,如果 d = 'day20170108' 然后 r 结束为 day20170108,当 d = 'day20170109' 然后 r 结束为 day20170109,当 d = 'day20170110' 然后 r 结束为 day20170110从 (选择stn, CONCAT('day', year, mo, da) AS d, ARRAY_AGG(t) AS r从`bigquery-public-data.noaa_gsod.gsod2017` AS tGROUP BY stn, d)'day20170101' 和 'day20170110' 之间的位置

正如您现在所看到的 - 数据透视表 (sparce_pivot_table) 足够稀疏(相同的 21.5 MB,但现在有 114,089 行,而数据透视表中有 11,584 行),因此它的平均行大小为 190B,而初始版本为 1.9KB.这显然比示例中的列数少 10 倍.
所以在使用这种方法之前,需要做一些数学计算来预测/估计可以做什么以及如何做!

仍然:数据透视表中的每个单元格都是原始表中整行的 JSON 表示.它不仅包含原始表中行的值,而且还包含一个架构

因此它非常冗长 - 因此单元格的大小可能比原始大小大数倍[这限制了这种方法的使用......除非你更有创意:o)......这仍然很多在这里申请的领域 :o) ]

In June the BQ team announced support for date-partitioned tables. But the guide is missing how to migrate old non-partitioned tables into the new style.

I am looking for a way to update several or if not all tables to the new style.

Also outside of DAY type partitioned what other options are available? Does the BQ UI show this, as I wasn't able to create such a new partitioned table from the BQ Web UI.

解决方案

from Pavan’s answer: Please note that this approach will charge you the scan cost of the source table for the query as many times as you query it.


from Pentium10 comments: So suppose I have several years of data, I need to prepare different query for each day and run all of it, and suppose I have 1000 days in history, I need to pay 1000 times the full query price from the source table?


As we can see - the main problem here is on having full scan for each and every day. The rest is less of a problem and can be easily scripted out in any client of the choice

So, below is to - How to partition table while avoid full table scan for each and every day?

Below step-by-step shows the approach

It is generic enough to extend/apply to anyone real use-case - meantime I am using bigquery-public-data.noaa_gsod.gsod2017 and I am limiting "exercise" to just 10 days to keep it readable

Step 1 – Create Pivot table
In this step we
a) compress each row’s content into record/array
and
b) put them all into respective "daily" column

#standardSQL
SELECT
  ARRAY_CONCAT_AGG(CASE WHEN d = 'day20170101' THEN r END) AS day20170101,
  ARRAY_CONCAT_AGG(CASE WHEN d = 'day20170102' THEN r END) AS day20170102,
  ARRAY_CONCAT_AGG(CASE WHEN d = 'day20170103' THEN r END) AS day20170103,
  ARRAY_CONCAT_AGG(CASE WHEN d = 'day20170104' THEN r END) AS day20170104,
  ARRAY_CONCAT_AGG(CASE WHEN d = 'day20170105' THEN r END) AS day20170105,
  ARRAY_CONCAT_AGG(CASE WHEN d = 'day20170106' THEN r END) AS day20170106,
  ARRAY_CONCAT_AGG(CASE WHEN d = 'day20170107' THEN r END) AS day20170107,
  ARRAY_CONCAT_AGG(CASE WHEN d = 'day20170108' THEN r END) AS day20170108,
  ARRAY_CONCAT_AGG(CASE WHEN d = 'day20170109' THEN r END) AS day20170109,
  ARRAY_CONCAT_AGG(CASE WHEN d = 'day20170110' THEN r END) AS day20170110
FROM (
  SELECT d, r, ROW_NUMBER() OVER(PARTITION BY d) AS line
  FROM (
    SELECT 
      stn, CONCAT('day', year, mo, da) AS d, ARRAY_AGG(t) AS r
    FROM `bigquery-public-data.noaa_gsod.gsod2017` AS t 
    GROUP BY stn, d
  ) 
)
GROUP BY line  

Run above query in Web UI with pivot_table (or whatever name is preferred) as a destination

As we can see - here we will get table with 10 columns – one column for one day and schema of each column is a copy of schema of original table:

Step 2 – Processing partitions one-by-one ONLY scanning respective column (no full table scan) – inserting into respective partition

#standardSQL
SELECT r.*
FROM pivot_table, UNNEST(day20170101) AS r

Run above query from Web UI with destination table named mytable$20160101

You can run same for next day

#standardSQL
SELECT r.*
FROM pivot_table, UNNEST(day20170102) AS r

Now you should have destination table as mytable$20160102 and so on

You should be able to automate/script this step with any client of your choice

There are many variations of how you can use above approach - it is up to your creativity

Note: BigQuery allows up to 10000 columns in table, so 365 columns for respective days of one year is definitely not a problem here :o) Unless there is a limitation on how far back you can go with new partitions – I heard (but didn’t have chance to check yet) there is now no more than 90 days back

Update

Please note: Above version has a little extra logic of packing all aggregated cells into as least final number of rows as possible.

ROW_NUMBER() OVER(PARTITION BY d) AS line
and then
GROUP BY line
along with
ARRAY_CONCAT_AGG(…)
does this

This works well when row size in your original table is not that big so final combined row size still will be within rows size limit that BigQuery has (which I believe is 10 MB as of now)

If your source table already has row size close to that limit – use below adjusted version

In this version – grouping is removed such that each row has only value for one column

#standardSQL
SELECT
    CASE WHEN d = 'day20170101' THEN r END AS day20170101,
    CASE WHEN d = 'day20170102' THEN r END AS day20170102,
    CASE WHEN d = 'day20170103' THEN r END AS day20170103,
    CASE WHEN d = 'day20170104' THEN r END AS day20170104,
    CASE WHEN d = 'day20170105' THEN r END AS day20170105,
    CASE WHEN d = 'day20170106' THEN r END AS day20170106,
    CASE WHEN d = 'day20170107' THEN r END AS day20170107,
    CASE WHEN d = 'day20170108' THEN r END AS day20170108,
    CASE WHEN d = 'day20170109' THEN r END AS day20170109,
    CASE WHEN d = 'day20170110' THEN r END AS day20170110
FROM (
    SELECT 
        stn, CONCAT('day', year, mo, da) AS d, ARRAY_AGG(t) AS r
    FROM `bigquery-public-data.noaa_gsod.gsod2017` AS t 
    GROUP BY stn, d
)
WHERE d BETWEEN 'day20170101' AND 'day20170110'

As you can see now - pivot table (sparce_pivot_table) is sparse enough (same 21.5 MB but now 114,089 rows vs. 11,584 rows in pivot_table) so it has average row size of 190B vs 1.9KB in initial version. Which is obviously about 10 times less as per number of columns in the example.
So before using this approach some math needs to be done to project/estimate what and how can be done!

Still: each cell in pivot table is sort of JSON representation of whole row in original table. It is such as it holds not just values as it was for rows in original table but also has a schema in it

As such it is quite verbose - thus the size of cell can be multiple times bigger than original size [which limits the usage of this approach ... unless you get even more creative :o) ... which is still plenty of areas here to apply :o) ]

这篇关于从非分区表迁移到分区表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆