如何针对具有记录字段的表创建视图? [英] How to create a view against a table that has record fields?
问题描述
我们每周都有一个备份流程,将我们的Google Appengine数据存储产品导出到Google云端存储,然后导入Google BigQuery。每周,我们创建一个名为 YYYY_MM_DD
的新数据集,其中包含当天的生产表副本。随着时间的推移,我们收集了许多数据集,例如 2014_05_10
, 2014_05_17
等。我想创建一个数据集 Latest_Production_Data
,它包含最近的 YYYY_MM_DD
数据集中每个表的视图。这将使下游报表更容易编写一次查询,并始终检索最近的数据。
为此,我使用获取最新数据集的代码和数据集包含在BigQuery API中的所有表的名称。然后,对于每个表格,我都会启动一个,但我不希望重复数据,如果我完全可以避免它。
div>这是我编写的用于动态生成解决方法代码 > SELECT
语句为每个表:
def get_leaf_column_selectors(dataset,table):
schema = table_service.get(
projectId = BQ_PROJECT_ID,
datasetId = dataset,
tableId = table
).execute()['schema']
return,\\\
.join([
_get_leaf_selectors(,top_field)
for schema [fields]
])
def _get_leaf_selectors(前缀,字段):
如果前缀:
format = prefix +。%s
else:
format =%s
如果'fields'不在字段中:
#基本情况
实际名称=格式%字段[名称]
safe_name = actual_name.replace(。,_)
返回%s作为%s%(actual_name,safe_name)
其他:
#递归案例
返回,\\\
.join([
_get_leaf_selectors(格式%field [name],sub_field)
用于字段[ fields]
])
We have a weekly backup process which exports our production Google Appengine Datastore onto Google Cloud Storage, and then into Google BigQuery. Each week, we create a new dataset named like YYYY_MM_DD
that contains a copy of the production tables on that day. Over time, we have collected many datasets, like 2014_05_10
, 2014_05_17
, etc. I want to create a data set Latest_Production_Data
that contains a view for each of the tables in the most recent YYYY_MM_DD
dataset. This will make it easier for downstream reports to write their query once and always retrieve the most recent data.
To do this, I have code that gets the most recent dataset and the names of all the tables that dataset contains from the BigQuery API. Then, for each of these tables, I fire a tables.insert call to create a view that is a SELECT *
from the table I am looking to create a reference to.
This fails for tables that contain a RECORD
field, from what looks to be a pretty benign column-naming rule.
For example, I have this table:
For which I issue this API call:
{
'tableReference': {
'projectId': 'redacted',
'tableId': u'AccountDeletionRequest',
'datasetId': 'Latest_Production_Data'
}
'view': {
'query': u'SELECT * FROM [2014_05_17.AccountDeletionRequest]'
},
}
This results in the following error:
HttpError: https://www.googleapis.com/bigquery/v2/projects//datasets/Latest_Production_Data/tables?alt=json returned "Invalid field name "__key__.namespace". Fields must contain only letters, numbers, and underscores, start with a letter or underscore, and be at most 128 characters long.">
When I execute this query in the BigQuery web console, the columns are renamed to translate the .
to an _
. I kind of expected the same thing to happen when I issued the create view API call.
Is there an easy way I can programmatically create a view for each of the tables in my dataset, regardless of their underlying schema? The problem I'm encountering now is for record columns, but another problem I anticipate is for tables that have repeated fields. Is there some magic alternative to SELECT *
that will take care of all these intricacies for me?
Another idea I had was doing a table copy, but I would prefer not to duplicate the data if I can at all avoid it.
Here is the workaround code I wrote to dynamically generate a SELECT
statement for each of the tables:
def get_leaf_column_selectors(dataset, table):
schema = table_service.get(
projectId=BQ_PROJECT_ID,
datasetId=dataset,
tableId=table
).execute()['schema']
return ",\n".join([
_get_leaf_selectors("", top_field)
for top_field in schema["fields"]
])
def _get_leaf_selectors(prefix, field):
if prefix:
format = prefix + ".%s"
else:
format = "%s"
if 'fields' not in field:
# Base case
actual_name = format % field["name"]
safe_name = actual_name.replace(".", "_")
return "%s as %s" % (actual_name, safe_name)
else:
# Recursive case
return ",\n".join([
_get_leaf_selectors(format % field["name"], sub_field)
for sub_field in field["fields"]
])
这篇关于如何针对具有记录字段的表创建视图?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!