如何在 BigQuery Standard SQL 中进行可重复采样? [英] How to do repeatable sampling in BigQuery Standard SQL?

查看:31
本文介绍了如何在 BigQuery Standard SQL 中进行可重复采样?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

本博客 一位 Google Cloud 员工解释了如何在 BigQuery 中对用于机器学习的数据集进行可重复采样.这对于创建(和复制)数据的训练/验证/测试分区非常重要.

In this blog a Google Cloud employee explains how to do repeatable sampling of data sets for machine learning in BigQuery. This is very important for creating (and replicating) train/validation/test partitions of your data.

然而,该博客使用了旧版 SQL,它Google 现在已弃用而支持标准 SQL.

However the blog uses Legacy SQL, which Google has now deprecated in favor of Standard SQL.

您将如何使用标准 SQL 重新编写如下所示的博客示例代码?

How would you re-write the blog's sampling code shown below, but using Standard SQL?

#legacySQL
SELECT
  date,
  airline,
  departure_airport,
  departure_schedule,
  arrival_airport,
  arrival_delay
FROM
  [bigquery-samples:airline_ontime_data.flights]
WHERE
  ABS(HASH(date)) % 10 < 8

推荐答案

标准 SQL 会这样重写查询:

Standard SQL would re-write the query thus:

#standardSQL
SELECT
  date,
  airline,
  departure_airport,
  departure_schedule,
  arrival_airport,
  arrival_delay
FROM
  `bigquery-samples.airline_ontime_data.flights`
WHERE
  ABS(MOD(FARM_FINGERPRINT(date), 10)) < 8

特别是这里的变化:

  • 用句点(不是冒号)将 Google Cloud 项目与表名分开.
  • 反引号(不是方括号)以转义表名中的连字符.
  • MOD 函数(不是 %).
  • FARM_FINGERPRINT(不是 HASH).这实际上是与 Legacy SQL 的 HASH 不同的散列函数,后者 不在事实随着时间的推移保持一致,正如博客所暗示的那样.
  • a period (not colon) to separate the Google Cloud project from table name.
  • backticks (not square brackets) to escape hyphen in the table name.
  • MOD function (not %).
  • FARM_FINGERPRINT (not HASH). This is actually a different hashing function than Legacy SQL's HASH, which wasn't in fact consistent over time as the blog had implied.

这篇关于如何在 BigQuery Standard SQL 中进行可重复采样?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆