Bigquery中的数组对之间的余弦相似度 [英] Cosine similarity between pair of arrays in Bigquery

查看:89
本文介绍了Bigquery中的数组对之间的余弦相似度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我创建了一个表,该表具有一对ID,并分别对每个ID进行协调,以便我可以计算它们之间的成对余弦相似度.

I have created a table that has a pair of IDs and coordinate fro each of them so that I can calculate pairwise cosine similarity between them.

该表如下所示

坐标的维数当前为128,但是可以变化.但是,一对ID的数字尺寸在同一张表中始终是相同的.

The number of dimension for the coords are currently 128, but it can vary. But the number dimensions for a pair of ID are always same in the same table.

coord1coord2是具有浮点值的重复字段(数组).

coord1 and coord2 are repeated field (array) with floating point values.

有没有一种方法可以计算它们之间的余弦相似度?

Is there a way to calculate cosine similarity between them?

我的预期输出将包含三列,分别为id1id2cosine_similarity.

My expected output would have three columns, id1, id2 and cosine_similarity.

推荐答案

以下是BigQuery标准SQL

Below is for BigQuery Standard SQL

#standardSQL
SELECT id1, id2, ( 
  SELECT 
    SUM(value1 * value2)/ 
    SQRT(SUM(value1 * value1))/ 
    SQRT(SUM(value2 * value2))
  FROM UNNEST(coord1) value1 WITH OFFSET pos1 
  JOIN UNNEST(coord2) value2 WITH OFFSET pos2 
  ON pos1 = pos2
  ) cosine_similarity
FROM `project.dataset.table`  

下面是一个虚拟的示例,供您玩

below is dummy example for you to play with

#standardSQL
WITH `project.dataset.table` AS (
  SELECT 1 id1, [1.0, 2.0, 3.0, 4.0] coord1, 2 id2, [1.0, 2.0, 3.0, 4.0] coord2 UNION ALL
  SELECT 3, [2.0, 0.0, 1.0, 1.0, 0, 2.0, 1.0, 1.0], 4, [2.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0]
)
SELECT id1, id2, ( 
  SELECT 
    SUM(value1 * value2)/
    SQRT(SUM(value1 * value1))/ 
    SQRT(SUM(value2 * value2))
  FROM UNNEST(coord1) value1 WITH OFFSET pos1 
  JOIN UNNEST(coord2) value2 WITH OFFSET pos2 
  ON pos1 = pos2
  ) cosine_similarity
FROM `project.dataset.table`  

有结果

Row id1 id2 cosine_similarity    
1   1   2   1.0  
2   3   4   0.8215838362577491   

这篇关于Bigquery中的数组对之间的余弦相似度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆