Bigquery中的数组对之间的余弦相似度 [英] Cosine similarity between pair of arrays in Bigquery
问题描述
我创建了一个表,该表具有一对ID,并分别对每个ID进行协调,以便我可以计算它们之间的成对余弦相似度.
I have created a table that has a pair of IDs and coordinate fro each of them so that I can calculate pairwise cosine similarity between them.
该表如下所示
坐标的维数当前为128,但是可以变化.但是,一对ID的数字尺寸在同一张表中始终是相同的.
The number of dimension for the coords are currently 128, but it can vary. But the number dimensions for a pair of ID are always same in the same table.
coord1
和coord2
是具有浮点值的重复字段(数组).
coord1
and coord2
are repeated field (array) with floating point values.
有没有一种方法可以计算它们之间的余弦相似度?
Is there a way to calculate cosine similarity between them?
我的预期输出将包含三列,分别为id1
,id2
和cosine_similarity
.
My expected output would have three columns, id1
, id2
and cosine_similarity
.
推荐答案
以下是BigQuery标准SQL
Below is for BigQuery Standard SQL
#standardSQL
SELECT id1, id2, (
SELECT
SUM(value1 * value2)/
SQRT(SUM(value1 * value1))/
SQRT(SUM(value2 * value2))
FROM UNNEST(coord1) value1 WITH OFFSET pos1
JOIN UNNEST(coord2) value2 WITH OFFSET pos2
ON pos1 = pos2
) cosine_similarity
FROM `project.dataset.table`
下面是一个虚拟的示例,供您玩
below is dummy example for you to play with
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 id1, [1.0, 2.0, 3.0, 4.0] coord1, 2 id2, [1.0, 2.0, 3.0, 4.0] coord2 UNION ALL
SELECT 3, [2.0, 0.0, 1.0, 1.0, 0, 2.0, 1.0, 1.0], 4, [2.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0]
)
SELECT id1, id2, (
SELECT
SUM(value1 * value2)/
SQRT(SUM(value1 * value1))/
SQRT(SUM(value2 * value2))
FROM UNNEST(coord1) value1 WITH OFFSET pos1
JOIN UNNEST(coord2) value2 WITH OFFSET pos2
ON pos1 = pos2
) cosine_similarity
FROM `project.dataset.table`
有结果
Row id1 id2 cosine_similarity
1 1 2 1.0
2 3 4 0.8215838362577491
这篇关于Bigquery中的数组对之间的余弦相似度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!