从select语句获取大查询表架构 [英] get big query table schema from select statements
问题描述
我意识到有100万种方法可以从Google大查询中的dataset.table中获取模式.
I realize there's a million ways to get a schema from a dataset.table in google big query....
是否有一种通过select语句获取架构数据的方法?例如查询SQL Server INFORMATION_SCHEMA表?
is there a way to get schema data via a select statement? such like querying sql servers INFORMATION_SCHEMA table?
谢谢.
推荐答案
我需要执行数据分析,而我拥有的唯一工具是webui上的QUERY函数.我想创建一个查询,该查询对每列计数空值,非空值,字符串长度等
I need to perform data profiling, and the only tool I have is the QUERY function on the webui. I want to create a query that counts nulls, non-nulls, string lengths, and such per column
以下是给您潜在的指导/想法,以探索和增强您的需求
它适用于简单的模式-看起来需要针对具有记录和重复的模式进行调整
另外,请注意,它会跳过表所有行中均为NULL的列-因此,对于以下方法,此类列不可见
Below is to give you potential direction/idea to explore and enhance up to your needs
It works relatively good for for simple schemas - looks like needs to be tuned for schemas with records and repeated
Also, note it skips columns which are NULLs in all rows of the table - so such columns are not visible for below approach
因此,将fh-bigquery.reddit.subreddits
作为简单的测试表:
So, with fh-bigquery.reddit.subreddits
as a simple test table :
#standardSQL
WITH `table` AS (
SELECT * FROM `fh-bigquery.reddit.subreddits`
),
table_as_json AS (
SELECT REGEXP_REPLACE(TO_JSON_STRING(t), r'^{|}$', '') AS row
FROM `table` AS t
),
pairs AS (
SELECT
REPLACE(column_name, '"', '') AS column_name,
IF(SAFE_CAST(column_value AS STRING)='null',NULL,column_value) AS column_value
FROM table_as_json, UNNEST(SPLIT(row, ',"')) AS z,
UNNEST([SPLIT(z, ':')[SAFE_OFFSET(0)]]) AS column_name,
UNNEST([SPLIT(z, ':')[SAFE_OFFSET(1)]]) AS column_value
)
SELECT
column_name,
COUNT(DISTINCT column_value) AS _distinct_values,
COUNTIF(column_value IS NULL) AS _nulls,
COUNTIF(column_value IS NOT NULL) AS _non_nulls,
MIN(LENGTH(SAFE_CAST(column_value AS STRING))) AS _min_length,
MAX(LENGTH(SAFE_CAST(column_value AS STRING))) AS _max_length,
ROUND(AVG(LENGTH(SAFE_CAST(column_value AS STRING)))) AS _avr_length
FROM pairs
WHERE column_name <> ''
GROUP BY column_name
ORDER BY column_name
结果是
column_name _nulls _non_nulls _min_length _max_length _avr_length
----------- ------ ---------- ----------- ----------- -----------
c_posts 0 2499 1 4 4.0
created_utc 0 2499 14 14 14.0
downs 0 2499 1 8 5.0
num_comments 0 2499 1 7 5.0
score 0 2499 1 7 5.0
subr 0 2499 4 23 12.0
ups 0 2499 1 8 5.0
我认为这与所谓的概要分析非常接近(并且在您可用的范围内)
您可以轻松添加任何列指标等.
I think it is very close to what is called profiling (and within the scope of what is available for you)
You can easily add any column metrics, etc.
我真的认为-这对您来说可能是一个很好的起点
I really think - this can be good starting point for you
这篇关于从select语句获取大查询表架构的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!