BigQuery JavaScript UDF流程-是按行还是按处理节点? [英] BigQuery JavaScript UDF process - per row or per processing node?

查看:134
本文介绍了BigQuery JavaScript UDF流程-是按行还是按处理节点?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在考虑将BigQuery的JavaScript UDF用作新数据架构中的关键组件.它将用于逻辑处理加载到主表中的每一行,并在定期和临时聚合查询期间处理每一行.

I'm thinking of using BigQuery's JavaScript UDF as a critical component in a new data architecture. It would be used to logically process each row loaded into the main table, and also to process each row during periodical and ad-hoc aggregation queries.

将SQL UDF用于相同目的似乎是不可行的,因为每一行代表一个复杂的对象,并且在SQL中实现业务逻辑(包括诸如解析复杂文本字段之类的事情)变得非常快.

Using an SQL UDF for the same purpose seems to be unfeasible because each row represents a complex object, and implementing the business logic in SQL, including things such as parsing complex text fields, gets ugly very fast.

我刚刚在优化查询计算文档页面:

最佳做法:避免使用JavaScript用户定义的函数.改用本机UDF.

Best practice: Avoid using JavaScript user-defined functions. Use native UDFs instead.

调用JavaScript UDF需要实例化子进程. 启动此过程并运行UD​​F直接影响查询 表现.如果可能,请改用本机(SQL)UDF.

Calling a JavaScript UDF requires the instantiation of a subprocess. Spinning up this process and running the UDF directly impacts query performance. If possible, use a native (SQL) UDF instead.

我理解为什么每个处理节点都需要一个新进程,而且我知道JS倾向于以单进程单线程的方式进行部署(即使v8现在确实支持多线程).但是我不清楚,一旦JS运行时进程启动,是否可以在调用同一函数之间重用它(例如,在同一处理节点上处理不同的行).重用量可能会严重影响成本.我的表不是很大(几千万到几亿行),但是我仍然需要在这里有一个更好的了解.

I understand why a new process for each processing node is needed, and I know that JS tends to be deployed in a single-thread-per-process manner (even though v8 does support multithreading these days). But it's not clear to me if once a JS runtime process is up, it can be expected to get reused between calls to the same function (e.g. for processing different rows on the same processing node). The amount of reuse will probably significantly affect the cost. My table is not that large (tens to hundreds of millions of rows), but still I need to have a better understanding here.

我找不到关于此的任何权威消息来源.是否有人在执行时间和成本方面对使用JavaScript UDF对每个处理的行的实际影响进行了任何分析?

I could not find any authoritative source on this. Has anybody done any analysis of the actual impact of using a JavaScript UDF on each processed row, in terms of execution time and cost?

推荐答案

如果未记录,则该实现细节可能会发生变化.但是让我们对其进行测试:

If it's not documented, then that's an implementation detail that could change. But let's test it:

CREATE TEMP FUNCTION randomThis(views INT64)
RETURNS FLOAT64
LANGUAGE js AS """
  if (typeof variable === 'undefined') {
     variable = Math.random()
  }
  return variable
""";

SELECT randomThis(views), COUNT(*) c
FROM (
  SELECT views
  FROM `fh-bigquery.wikipedia_v3.pageviews_2019` 
  LIMIT 10000000
)
GROUP BY 1
ORDER BY 2 DESC

我原本期望一千万个不同的数字,或者说是几个,但我只有一个:同一过程被重复使用了一千万次,并且变量在两次调用之间保持不变.

I was expecting ten million different numbers, or a handful, but I only got one: The same process was reused ten million times, and variables were kept around in between calls.

这甚至在我达到1亿美元时就发生了,这表明并行性受一个JS VM限制.

This even happened when I went up to 100 million, signaling that parallelism is bounded by one JS VM.

同样,这些是可以更改的实现细节.但是,尽管它保持这种方式,但您可以充分利用它.

Again, these are implementation details that could change. But while it stays that way, you can make the best use out of it.

这篇关于BigQuery JavaScript UDF流程-是按行还是按处理节点?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆