如何在BigQuery Standard SQL中进行可重复采样? [英] How to do repeatable sampling in BigQuery Standard SQL?

查看:154
本文介绍了如何在BigQuery Standard SQL中进行可重复采样?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

此博客一位Google Cloud员工介绍了如何对BigQuery中的机器学习进行数据集的可重复采样.这对于创建(和复制)数据的训练/验证/测试分区非常重要.

In this blog a Google Cloud employee explains how to do repeatable sampling of data sets for machine learning in BigQuery. This is very important for creating (and replicating) train/validation/test partitions of your data.

但是该博客使用旧版SQL,该旧版SQL Google现在已弃用,转而支持标准SQL.

However the blog uses Legacy SQL, which Google has now deprecated in favor of Standard SQL.

如何使用标准SQL重新编写如下所示的博客示例代码?

How would you re-write the blog's sampling code shown below, but using Standard SQL?

#legacySQL
SELECT
  date,
  airline,
  departure_airport,
  departure_schedule,
  arrival_airport,
  arrival_delay
FROM
  [bigquery-samples:airline_ontime_data.flights]
WHERE
  ABS(HASH(date)) % 10 < 8

推荐答案

标准SQL会这样重写查询:

Standard SQL would re-write the query thus:

#standardSQL
SELECT
  date,
  airline,
  departure_airport,
  departure_schedule,
  arrival_airport,
  arrival_delay
FROM
  `bigquery-samples.airline_ontime_data.flights`
WHERE
  ABS(MOD(FARM_FINGERPRINT(date), 10)) < 8

具体是这些更改:

  • 句点(不是冒号),用于将Google Cloud项目与表名分开.
  • 反引号(而不是方括号)以转义表名称中的连字符.
  • MOD函数(不是%).
  • FARM_FINGERPRINT(不是HASH).实际上,这是与旧版SQL HASH不同的哈希函数,该函数实际上随着时间的推移并不一致就像博客所暗示的那样.
  • a period (not colon) to separate the Google Cloud project from table name.
  • backticks (not square brackets) to escape hyphen in the table name.
  • MOD function (not %).
  • FARM_FINGERPRINT (not HASH). This is actually a different hashing function than Legacy SQL's HASH, which wasn't in fact consistent over time as the blog had implied.

这篇关于如何在BigQuery Standard SQL中进行可重复采样?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆