使用嵌套和重复字段获取最新的列值 [英] Get Most Recent Column Value With Nested And Repeated Fields

查看:101
本文介绍了使用嵌套和重复字段获取最新的列值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一张具有以下结构的表格:



以及以下数据其中:

  [
{
addresses:[
{
城市:纽约
},
{
城市:旧金山
}
],
年龄 :26.0,
name:Foo Bar,
createdAt:2016-02-01 15:54:25 UTC
},
{
addresses:[
{
city:纽约
},
{
城市:旧金山
}
],
age:26.0,
name:Foo Bar,
createdAt:2016-02-01 15 :54:16 UTC

]

我想要什么要做的是重新创建相同的表(相同的结构),但只有最新版本的行。在这个例子中,我们假设我想按名称对所有内容进行分组,并将最新的createdAt作为该行。
我尝试这样做:



并且您希望获得基于createdAt列的最新记录,因此结果将显示例如:



下面的代码是这样的:

  SELECT name,age,createdAt,addresses.city 
FROM JS(
(// input table
SELECT name,age,createdAt,NEST(city)AS地址
FROM(
SELECT name,age,createdAt,addresses.city
FROM(
SELECT
姓名,年龄,createdAt,地址.city,
MAX(createdAt)OVER(PARTITION BY name,age)AS lastAt $ b $ FROM yourTable

WHERE createdAt = lastAt

GROUP BY name,age,createdAt
),
name,age,createdAt,addresses,// input columns
[// output schema
{'name':'name','type':'STRING '',
{'name':'age','type':'INTEGER'},
{'name':'createdAt','type':'INTEGER'},
{'name':'addresses','type':'RECORD',
'mode':'REPEATED',
'fields':[
{'name':'city ','type':'STRING'}
]
}
],
function(row,emit){//函数
var c = [ ]。 (var i = 0; i< row.addresses.length; i ++){
c.push({city:row.addresses [i]})的
;
};
emit({name:row.name,age:row.age,createdAt:row.createdAt,addresses:c});
}

上面的代码工作方式是:它隐含地变平原始记录;查找属于最近记录的行(按名称和年龄划分);将这些行组合到相应的记录中;最后一步是使用 JS UDF 构建适当的模式,实际上可以将它们写回到BigQuery表中,作为嵌套/重复与拼合



最后一步是这个解决方法中最烦人的部分,因为它需要每次为特定模式定制

只是地址记录中的一个嵌套字段,因此 NEST()功能起作用。在内部不只有一个
字段的情况下 - 以上方法仍然有效,但您需要涉及这些fi的并置把它们放在nest()里面,而不是在js里面做额外的分割,等等。

你可以在下面的答案中看到例子:

创建一个表与记录类型列

用列类型RECORD创建表格 >
如何在不改变表格模式的情况下将查询结果存储在当前表格中?



我希望这是很好的基础你可以尝试并让你的案例工作!


I have a table with the following structure:

and the following data in it:

[
  {
    "addresses": [
      {
        "city": "New York"
      },
      {
        "city": "San Francisco"
      }
    ],
    "age": "26.0",
    "name": "Foo Bar",
    "createdAt": "2016-02-01 15:54:25 UTC"
  },
  {
    "addresses": [
      {
        "city": "New York"
      },
      {
        "city": "San Francisco"
      }
    ],
    "age": "26.0",
    "name": "Foo Bar",
    "createdAt": "2016-02-01 15:54:16 UTC"
  }
]

What I'd like to do is recreate the same table (same structure) but with only the latest version of a row. In this example let's say that I'd like to group by everything by name and take the row with the most recent createdAt. I tried to do something like this: Google Big Query SQL - Get Most Recent Column Value but I couldn't get it to work with record and repeated fields.

解决方案

I really hoped someone from Google Team will provide answer on this question as it is very frequent topic/problem asked here on SO. BigQuery definitelly not friendly enough with writing Nested / Repeated stuff back to BQ off of BQ query.

So, I will provide the workaround I found relatively long time ago. I DO NOT like it, but (and that is why I hoped for the answer from Google Team) it works. I hope you will be able to adopt it for you particular scenario

So, based on your example, assume you have table as below

and you expect to get most recent records based on createdAt column, so result will look like:

Below code does this:

SELECT name, age, createdAt, addresses.city
FROM JS( 
  ( // input table 
    SELECT name, age, createdAt, NEST(city) AS addresses 
    FROM (
      SELECT name, age, createdAt, addresses.city 
      FROM (
        SELECT 
          name, age, createdAt, addresses.city, 
          MAX(createdAt) OVER(PARTITION BY name, age) AS lastAt
        FROM yourTable
      )
      WHERE createdAt = lastAt
    )
    GROUP BY name, age, createdAt
  ), 
  name, age, createdAt, addresses, // input columns 
  "[ // output schema 
    {'name': 'name', 'type': 'STRING'},
    {'name': 'age', 'type': 'INTEGER'},
    {'name': 'createdAt', 'type': 'INTEGER'},
    {'name': 'addresses', 'type': 'RECORD',
     'mode': 'REPEATED',
     'fields': [
       {'name': 'city', 'type': 'STRING'}
       ]    
     }
  ]", 
  "function(row, emit) { // function 
    var c = []; 
    for (var i = 0; i < row.addresses.length; i++) { 
      c.push({city:row.addresses[i]});
    }; 
    emit({name: row.name, age: row.age, createdAt: row.createdAt, addresses: c}); 
  }"
) 

the way above code works is: it implicitely flattens original records; find rows that belong to most recent records (partitioned by name and age); assembles those rows back into respective records. final step is processing with JS UDF to build proper schema that can be actually written back to BigQuery Table as nested/repeated vs flatten

The last step is the most annoying part of this workaround as it needs to be customized each time for specific schema(s)

Please note, in this example - it is only one nested field inside addresses record, so NEST() fuction worked. In scenarious when you have more than just one field inside - above approach still works, but you need to involve concatenation of those fields to put them inside nest() and than inside js function to do extra splitting those fields, etc.
You can see examples in below answers:
Create a table with Record type column
create a table with a column type RECORD
How to store the result of query on the current table without changing the table schema?

I hope this is good foundation for you to experiment with and make your case work!

这篇关于使用嵌套和重复字段获取最新的列值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆