在BigQuery中加入交叉后的行聚合 [英] Row Aggregation after Cross Join in BigQuery

查看：170 发布时间：2018/5/7 17:29:27 sql google-bigquery aggregation data-analysis cross-join

本文介绍了在BigQuery中加入交叉后的行聚合的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

假设您在BigQuery中有以下表格：

A = user1 | 0 0 | user2 | 0 3 | user3 | 4 0 |
交叉加入后，您有

dist = | user1 user2 0 0，0 3 | #comma仅显示用户val分隔 | user1 user3 0 0，4 0 | | user2 user3 0 3，4 0 |
如何在BigQuery中执行行聚合以计算跨行的成对聚合。作为典型的用例，您可以计算两个用户之间的欧氏距离。我想计算两个用户之间的以下度量：

pre $ code> sum（min（user1_row [i]，user2_row [i]） / abs（user1_row [i] - user2_row [i]））

例如在Python中，您可以简单地：

（用户1，用户2，np.sum（min - r2 [i]））]）

解决方案
用丑陋的方式：你可以将数学平滑到查询中。也就是说，把我在... sum（min（...）/ abs（...））中的
转换为SQL，领域。请注意， MIN 和 SUM 是您不想使用的集合函数。对于，对于SUM使用 + ，对于 IF（a MIN 。 ABS（a，b）看起来像 IF（a 。如果你只是计算欧几里德距离，你可以做 pre $ SELECT $ left.user，right.user， SQRT （（left.x-right.x）*（left.x-right.x） +（left.y-right.y）*（left.y-right.y） +（ left.z-right.z）*（left.z-right.z））as dist FROM（ SELECT * FROM dataset.table1 AS left CROSS JOIN dataset .table1 AS right）
更好的方法是用户定义函数，并将矢量创建为重复值。然后，您可以编写一个 DISTANCE（）函数，该函数通过交叉连接的左侧和右侧的两个数组执行计算。如果您不在UDF测试版计划中并希望加入，请联系Google云支持。
最后，如果您将架构从 {user：string，field1：float，field2：float，field3：float，...} to {user：string，fields：[field：float]} 然后，您可以将场地放平并进行交叉连接。如： SELECT 用户，字段，索引， FROM（FLATTEN（（ SELECT user， fields.field as field， POSITION（fields.field）as index， from [dataset1.table1] $）b 如果将此视图保存为视图，请将其称为dataset1。 flat_view 然后您可以进行连接： SELECT left .user作为user1，right.user作为user2， left.field作为l，right.field作为r，从dataset1.flat_view离开 JOIN dataset1.flat_view right ON left.index = right.index WHERE left.user！= right.user 会为每对用户和每个字段匹配字段分别给出一行。您可以将其保存为dataset1.joined_view视图。最后，您可以进行汇总：因为你想这样： $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ c $ sum $ ] - user2_row [i]））它看起来像： SELECT用户1，用户2， SUM（（if（l r，lr ，rl）） FROM [dataset1.joined_view] GROUP EACH BY user1，user2 Say you have the following table in BigQuery: A = user1 | 0 0 | user2 | 0 3 | user3 | 4 0 | After a cross join, you have dist = |user1 user2 0 0 , 0 3 | #comma is just showing user val seperation |user1 user3 0 0 , 4 0 | |user2 user3 0 3 , 4 0 | How can you perform row aggregation in BigQuery to compute pairwise aggregation across rows. As a typical use case, you could compute the euclidean distance between the two users. I want to compute the following metric between the two users: sum(min(user1_row[i], user2_row[i]) / abs(user1_row[i] - user2_row[i])) summed over all i for each pair of users. For example in Python you would simply: for i in np.arange(row_length/2)]): dist.append([user1, user2, np.sum(min(r1[i], r2[i]) / abs(r1[i] - r2[i]))]) 解决方案 To start with the ugly way: you could flatten out the math into the query. That is, turn for i in ... sum(min(...)/abs(...)) into SQL operating over each of the fields. Note that MIN and SUM are aggregate functions that you won't want to use. Instead use + for SUM and IF(a < b, a, b) for MIN. ABS(a, b) looks like IF(a < b, b-a, a-b). If you were just computing the Euclidian distance, you could do SELECT left.user, right.user, SQRT((left.x-right.x)*(left.x-right.x) + (left.y-right.y)*(left.y-right.y) + (left.z-right.z)*(left.z-right.z)) as dist FROM ( SELECT * FROM dataset.table1 AS left CROSS JOIN dataset.table1 AS right) The nicer way is User-Defined Functions, and create the vectors as repeated values. You can then write a DISTANCE() function that performs your computation over the two arrays from the left and the right side of the cross join. If you're not in the UDF beta program and would like to join, please contact google cloud support. Finally, if you change your schema from {user:string, field1:float, field2:float, field3:float,...} to {user:string, fields:[field:float]} You could then flatten the field with position and do the cross join on that. As in: SELECT user, field, index, FROM (FLATTEN(( SELECT user, fields.field as field, POSITION(fields.field) as index, from [dataset1.table1] ), fields)) If you save this as a view, call it "dataset1.flat_view" Then you can do your join: SELECT left.user as user1, right.user as user2, left.field as l, right.field as r, FROM dataset1.flat_view left JOIN dataset1.flat_view right ON left.index = right.index WHERE left.user != right.user This will give you one row each for each pair of users and each field matching field. You can save that as the view "dataset1.joined_view". Finally, you can do your aggregations: Since you want this: sum(min(user1_row[i], user2_row[i]) / abs(user1_row[i] - user2_row[i])) it would look like: SELECT user1, user2, SUM((if (l < r, l, r)) / (if (l > r, l-r, r-l)) FROM [dataset1.joined_view] GROUP EACH BY user1, user2 这篇关于在BigQuery中加入交叉后的行聚合的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在BigQuery中加入交叉后的行聚合 [英] Row Aggregation after Cross Join in BigQuery

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

在BigQuery中加入交叉后的行聚合 [英] Row Aggregation after Cross Join in BigQuery

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭