在BigQuery中UNNESTING多个数组 [英] UNNESTING multiple arrays in BigQuery

查看:135
本文介绍了在BigQuery中UNNESTING多个数组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在这个例子中,我有一本书数据库,每本书有一个记录。记录包含书籍所有者,流派和其他信息。我需要返回每个所有者,每个流派的前20位的样本以及该行中的所有数据。



我有这个代码,这是我需要的对于行中的一个数据点(Data_one):

pre $ lt; code> WITH`project.dataset.table` AS(
SELECT
名称名称,
类型流派,
Data_one org
从`project.dataset.booktable`
),搜索AS(
SELECT name,genre FROM
UNNEST(['Alex','James'])名称,
UNNEST(['HORROR','COMEDY'])类型

SELECT name,genre,org
FROM(
SELECT t.name,t.genre,ARRAY_AGG(t.org LIMIT 20)orgs
FROM`project.dataset.table` t JOIN搜索s
ON LOWER (s.name)= LOWER(t.name)
AND LOWER(s.genre)= LOWER(t.genre)
WHERE RAND()<0.5
GROUP BY t.name ,t.genre
),UNNEST(orgs)org
ORDER BY名称,流派,org

但是,当我试图扩展它的工作一秒钟(最终相当一些数据),它会使记录返回200倍:

  WITH`project.dataset .table` AS(
SELECT
名称名称,
类型流派,
Data_one org,
Data_two org2
从`project.dataset.booktable`
),搜索AS(
SELECT name,genre FROM
UNNEST(['Alex','James'])name,
UNNEST(['HORROR','COMEDY'] )流派

选择名称,流派,org,org2
FROM(
SELECT t.name,t.genre,ARRAY_AGG(t.org LIMIT 20)orgs,ARRAY_AGG( t.org2 LIMIT 20)orgs2
FROM`project.dataset.table` t JOIN搜索s
ON LOWER(s.name)= LOWER(t.name)
AND LOWER(s。流派)= LOWER(t.genre)
WHERE RAND()< 0.5
GROUP BY t.name,t.genre
),UNNEST(orgs)org,UNNEST(orgs2)org2
ORDER BY名称,流派,org,org2

我知道UNNEST将一个数组转换为一个表,但是这是以某种方式创建一个数组的数组并将其解开?我不熟悉这个语法。

编辑:
我试图得到的数据全部在同一个级别上,所有单个数据点(没有数组)以及混合了NULLABLE STRINGS,INTEGERS,TIMESTAMPS,FLOATS


EG:

 类型STRING NULLABLE 
名称STRING NULLABLE
Data_one STRING NULLABLE
Data_two STRING NULLABLE
Data_three INTEGER NULLABLE
Data_four TIMESTAMP NULLABLE

所有者|类型| Data_one | Data_two | Data_three | Data_four
Alex |恐怖|斯蒂芬金| IT | 3 | 2018-01-02
Alex |科幻| Andy Weir |火星人| 5 | 2018-01-02
James |恐怖|布拉姆斯托克|德古拉| 2 | 2018-01-02
Sarah |恐怖|斯蒂芬金| The Stand | 3 | 2018-01-02
James |恐怖|斯蒂芬金|宠物Sematary | 2 | 2018-01-02


解决方案

详细信息 - 下面的答案只是您探索的一个方向



#standardSQL
SELECT name,genre,data_one,data_two FROM(
SELECT t.name,t.genre,ARRAY_AGG(t.org LIMIT 20)orgs,ARRAY_AGG(t.org2 LIMIT 20)orgs2
FROM`project.dataset.table` t JOIN搜索s
ON LOWER(s.name)= LOWER(t.name)
AND LOWER(s.genre)= LOWER(t.genre )
WHERE RAND()<0.5
GROUP BY t.name,t.genre
),UNNEST(orgs)data_one WITH OFFSET pos1
,UNNEST(orgs2)data_two WITH OFFSET pos2
WHERE pos1 = pos2
ORDER BY name,genre,data_one

正如你所看到的 - 在这里,OFFSET被引入了识别阵列中元素的位置,然后只剩下那些具有相同位置的组合。

在真实用例 - 你最有可能有另一个字段,用于标识data_one和data_two属于同一行,并且该字段可用于将这些data_one和data_two配对



希望这有助于让你的方向


更新




<


$ b

  #standardSQL 
SELECT name ,类型,data.data_one,data.data_two,data.data_three,data.data_four
FROM(
SELECT t.name,t.genre,
ARRAY_AGG(STRUCT(data_one,data_two,data_three ,data_four)LIMIT 20)data
FROM`project.dataset.table` t JOIN搜索s
ON LOWER(s.name)= LOWER(t.name)
AND LOWER(s。流派)= LOWER(t.genre)
WHERE RAND()< 0.5
GROUP BY t.name,t.genre
),UNNEST(数据)数据
ORDER BY名称,类型

这正是我在另一篇文章()中对第一个相关问题的评论中提到的,您可以在其中使用org.data_one,org.data_two选择语句


In this example, I have a book database, with one record per book. The records contain the book owners, the genre, and some other info. I need to return a sample of the top 20 per owner, per genre, along with all the data in the row.

I have this code, which does what I need for one data point in the row (Data_one):

WITH `project.dataset.table` AS (
  SELECT 
    Name name, 
    Genre genre, 
    Data_one org
  FROM `project.dataset.booktable`
), search AS (
  SELECT name, genre FROM
  UNNEST(['Alex','James']) name, 
  UNNEST(['HORROR','COMEDY']) genre
)
SELECT name, genre, org 
FROM (
  SELECT t.name, t.genre, ARRAY_AGG(t.org LIMIT 20) orgs
  FROM `project.dataset.table` t JOIN search s 
  ON LOWER(s.name) = LOWER(t.name) 
  AND LOWER(s.genre) = LOWER(t.genre) 
  WHERE RAND() < 0.5
  GROUP BY t.name, t.genre
), UNNEST(orgs) org
ORDER BY name, genre, org

But when I try to extend it to work for a second (and eventually quite a few) piece of data from the row, it inflates the records returned by a factor of 200:

WITH `project.dataset.table` AS (
  SELECT 
    Name name, 
    Genre genre, 
    Data_one org,
    Data_two org2
  FROM `project.dataset.booktable`
), search AS (
  SELECT name, genre FROM
  UNNEST(['Alex','James']) name, 
  UNNEST(['HORROR','COMEDY']) genre
)
SELECT name, genre, org, org2 
FROM (
  SELECT t.name, t.genre, ARRAY_AGG(t.org LIMIT 20) orgs, ARRAY_AGG(t.org2 LIMIT 20) orgs2
  FROM `project.dataset.table` t JOIN search s 
  ON LOWER(s.name) = LOWER(t.name) 
  AND LOWER(s.genre) = LOWER(t.genre) 
  WHERE RAND() < 0.5
  GROUP BY t.name, t.genre
), UNNEST(orgs) org, UNNEST(orgs2) org2
ORDER BY name, genre, org, org2

I know UNNEST turns an array into a table, but is this somehow creating an array of an array and unnesting that? I am unfamiliar with the syntax.

Edit: The data I am trying to get is all on the same level, all single data points (no arrays) and a mixture of NULLABLE STRINGS, INTEGERS, TIMESTAMPS, FLOATS

E.G:

Genre   STRING  NULLABLE
Name    STRING  NULLABLE    
Data_one    STRING  NULLABLE    
Data_two    STRING  NULLABLE    
Data_three  INTEGER NULLABLE    
Data_four   TIMESTAMP   NULLABLE    

Owner   |   Genre    |   Data_one    | Data_two   |Data_three|Data_four
Alex    |   Horror   |  Stephen King |    IT      |    3     |2018-01-02
Alex    |   Sci-fi   |   Andy Weir   |The Martian |    5     |2018-01-02
James   |   Horror   |  Bram Stoker  |   Dracula  |    2     |2018-01-02
Sarah   |   Horror   |  Stephen King | The Stand  |    3     |2018-01-02
James   |   Horror   |  Stephen King |Pet Sematary|    2     |2018-01-02

解决方案

as your question leaks details - below answer is just a direction for you to explore

#standardSQL
SELECT name, genre, data_one, data_two FROM (
  SELECT t.name, t.genre, ARRAY_AGG(t.org LIMIT 20) orgs, ARRAY_AGG(t.org2 LIMIT 20) orgs2
  FROM `project.dataset.table` t JOIN search s 
  ON LOWER(s.name) = LOWER(t.name) 
  AND LOWER(s.genre) = LOWER(t.genre) 
  WHERE RAND() < 0.5
  GROUP BY t.name, t.genre
), UNNEST(orgs) data_one WITH OFFSET pos1
, UNNEST(orgs2) data_two WITH OFFSET pos2
WHERE pos1 = pos2
ORDER BY name, genre, data_one

As you can see - here OFFSET was introduced identifying position of elements within the array and then leaving in result only those combinations which have same positions

In real use case - you most likely have some yet another field that identifies which data_one and data_two belong to the same row and that field can be used to pair those data_one and data_two

Hope this helped to get you direction

Update

as you added schema/example - see below

#standardSQL
SELECT name, genre, data.data_one, data.data_two, data.data_three, data.data_four 
FROM (
  SELECT t.name, t.genre, 
    ARRAY_AGG(STRUCT(data_one, data_two, data_three, data_four) LIMIT 20) data
  FROM `project.dataset.table` t JOIN search s 
  ON LOWER(s.name) = LOWER(t.name) 
  AND LOWER(s.genre) = LOWER(t.genre) 
  WHERE RAND() < 0.5
  GROUP BY t.name, t.genre
), UNNEST(data) data
ORDER BY name, genre

That is exactly what I mentioned in comments to your very first related question in another post (you can just use org.data_one, org.data_two in you select statement)

这篇关于在BigQuery中UNNESTING多个数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆