在Bigquery中计算大量向量之间的成对余弦相似度 [英] Calculating pairwise cosine similarity between quite a large number of vectors in Bigquery

查看:86
本文介绍了在Bigquery中计算大量向量之间的成对余弦相似度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个表 id_vectors ,其中包含 id 及其对应的 coordinates .每个 coordinates 是一个重复字段,其中包含512个元素.

我正在寻找所有这些向量之间的成对余弦相似度,例如如果我有三个 id 1,2和3,那么我正在寻找一个表,它们之间有余弦相似性(基于使用512坐标的计算),如下所示:

  id1 id2相似性1 2 0.51 3 0.12 3 0.99 

现在我的表格中有424,970个唯一的 ID 及其对应的512维坐标.这意味着基本上我需要创建大约(424970 * 424969/2)唯一的ID对,并计算它们的相似性.

我首先尝试使用此处的参考进行以下查询a>:

  #standardSQL与成对为(选择t1.id为id_1,t1.coords为coord1,t2.id为id_2,t2.coords为coord2从`project.dataset.id_vectors` t1内部连接`project.dataset.id_vectors` t2在t1.id<t2.id)选择id_1,id_2,(选择SUM(值1 *值2)/SQRT(SUM(值1 *值1))/SQRT(SUM(值2 *值2))从UNNEST(coord1)值1偏移pos1联接UNNEST(coord2)value2和pos2偏移量开启pos1 = pos2)cosine_similarity从成对 

但是运行6个小时后,我遇到了以下错误消息查询超出资源限制.已使用2.2127481953201417E7 CPU秒,并且此查询所使用的CPU秒必须少于428000.0 CPU秒.

然后,我想而不是使用中间表 pairwise ,为什么我不先尝试创建该表,然后再进行余弦相似度计算.

所以我尝试了以下查询:

 选择将t1.id作为id_1,将t1.coords作为coord1,将t2.id作为id_2,将t2.coords作为coord2从`project.dataset.id_vectors` t1内部连接`project.dataset.id_vectors` t2在t1.id<t2.id 

但是这次查询无法完成,我遇到了以下消息:错误:超出配额:您的项目超出了随机播放总大小限制的配额.有关更多信息,请参阅https://cloud.google.com/bigquery/troubleshooting-errors .

然后,我尝试通过使用以下查询仅创建ID的组合对并从中删除坐标来创建甚至更小的表:

 选择t1.id为id_1,t2.id为id_2从`project.dataset.id_vectors` t1内部连接`project.dataset.id_vectors` t2在t1.id<t2.id 

同样,我的查询最终显示错误消息查询超出资源限制.使用了610104.3843576935 CPU秒,并且此查询必须使用少于3000.0 CPU秒.(错误代码:billingTierLimitExceeded)

我完全理解这是一个巨大的查询,而我的出发点是我的帐单配额.

我要问的是,有没有一种方法可以以更智能的方式执行查询,因此我不会超过 resourceLimit shuffleSizeLimit billingTierLimit ?

简单的想法是-与其创建具有冗余坐标的表本身,而不仅仅是创建简单的成对表(id1,id2),所以您将通过对 dataset.table.id_vectors

进行两次额外的连接,用其坐标矢量修饰"各自的id

下面是一个简短的示例:

  #standardSQL与成对AS(选择t1.id AS id_1,t2.id AS id_2从`project.dataset.id_vectors` t1内部联接`project.dataset.id_vectors` t2在t1.id<t2.id)选择id_1,id_2,(选择SUM(值1 *值2)/SQRT(SUM(值1 *值1))/SQRT(SUM(值2 *值2))从UNNEST(a.coords)值1偏移pos1联接UNNEST(b.coords)value2与pos2的偏移量开启pos1 = pos2)cosine_similarity从成对t在a.id = id_1上加入`project.dataset.id_vectors`在b.id = id_2上加入`project.dataset.id_vectors` b 

很明显,它适用于小型假人集,如下所示:

  #standardSQL与`project.dataset.id_vectors` AS(SELECT 1 id,[1.0,2.0,3.0,4.0]坐标UNION ALL选择2,[1.0,2.0,3.0,4.0] UNION ALL选择3,[2.0、0.0、1.0、1.0] UNION ALLSELECT 4,[0,2.0,1.0,1.0] UNION ALLSELECT 5,[2.0,1.0,1.0,0.0] UNION ALLSELECT 6,[1.0,1.0,1.0,1.0]),成对AS(选择t1.id AS id_1,t2.id AS id_2从`project.dataset.id_vectors` t1内部联接`project.dataset.id_vectors` t2在t1.id<t2.id)选择id_1,id_2,(选择SUM(值1 *值2)/SQRT(SUM(值1 *值1))/SQRT(SUM(值2 *值2))从UNNEST(a.coords)值1偏移pos1联接UNNEST(b.coords)value2与pos2的偏移量开启pos1 = pos2)cosine_similarity从成对t在a.id = id_1上加入`project.dataset.id_vectors`在b.id = id_2上加入`project.dataset.id_vectors` b 

有结果

 行id_1 id_2余弦相似度1 1 2 1.02 1 3 0.67082039324993693 1 4 0.8198915917499234 1 5 0.5217491947499515 1 6 0.91287092917527696 2 3 0.67082039324993697 2 4 0.8198915917499238 2 5 0.5217491947499519 2 6 0.912870929175276910 3 4 0.333333333333333411 3 5 0.833333333333333512 3 6 0.816496580927726113 4 5 0.500000000000000114 4 6 0.816496580927726115 5 6 0.8164965809277261 

因此,请尝试使用您的真实数据,让我们看看它如何为您工作:o)

然后...显然,您应该预先创建/实现成对

另一个优化想法是在您的 project.dataset.id_vectors 中预先计算 SQRT(SUM(value1 * value1))的值-这样可以节省大量CPU-这应该是简单的调整,所以我留给你:o)

I have a table id_vectors that contains id and their corresponding coordinates. Each of the coordinates is a repeated fields with 512 elements inside it.

I am looking for pairwise cosine similarity between all those vectors, e.g. If I have three ids 1,2 and 3 then I am looking for a table where I have cosine similarity between them (based on the calculation using 512 coordinates) like below:

id1   id2   similarity
 1     2      0.5
 1     3      0.1
 2     3      0.99

Now in my table I have 424,970 unique ID and their corresponding 512-dimension coordinates. Which means that basically I need to create around (424970 * 424969 / 2) unique pair of IDs and calculate their similarity.

I first tried with the following query using reference from here:

#standardSQL
with pairwise as
(SELECT t1.id as id_1, t1.coords as coord1, t2.id as id_2, t2.coords as coord2
FROM `project.dataset.id_vectors` t1
inner join `project.dataset.id_vectors` t2
on t1.id < t2.id)

SELECT id_1, id_2, ( 
  SELECT 
    SUM(value1 * value2)/ 
    SQRT(SUM(value1 * value1))/ 
    SQRT(SUM(value2 * value2))
  FROM UNNEST(coord1) value1 WITH OFFSET pos1 
  JOIN UNNEST(coord2) value2 WITH OFFSET pos2 
  ON pos1 = pos2
  ) cosine_similarity
FROM pairwise

But after running for 6 hrs I encountered the following error message Query exceeded resource limits. 2.2127481953201417E7 CPU seconds were used, and this query must use less than 428000.0 CPU seconds.

Then I thought rather than using an intermediate table pairwise, why don't I try to create that table first then do the cosine similarity calculation.

So I tried the following query:

SELECT t1.id as id_1, t1.coords as coord1, t2.id as id_2, t2.coords as coord2
FROM `project.dataset.id_vectors` t1
inner join `project.dataset.id_vectors` t2
on t1.id < t2.id

But this time the query could not be completed and I encountered the following message: Error: Quota exceeded: Your project exceeded quota for total shuffle size limit. For more information, see https://cloud.google.com/bigquery/troubleshooting-errors.

Then I tried to create even a smaller table, by just creating the combination pairs of the ids and stripping off the coordinates from it, using the following query:

SELECT t1.id as id_1, t2.id as id_2
FROM `project.dataset.id_vectors` t1
inner join `project.dataset.id_vectors` t2
on t1.id < t2.id

Again my query ends up with the error message Query exceeded resource limits. 610104.3843576935 CPU seconds were used, and this query must use less than 3000.0 CPU seconds. (error code: billingTierLimitExceeded)

I totally understand that this is a huge query and my stopping point is my billing quota.

What I am asking is that, is there a way to execute the query in a smarter way so that I do not exceed either of the resourceLimit, shuffleSizeLimit or billingTierLimit?

解决方案

Quick idea is - instead of joining table on itself with redundant coordinates - you should rather just create simple table of pairs (id1, id2), so then you will "dress" respective id's with their coordinates vectors by having two extra joining to dataset.table.id_vectors

Below is quick example of how this could looks like:

#standardSQL
WITH pairwise AS (
  SELECT t1.id AS id_1, t2.id AS id_2
  FROM `project.dataset.id_vectors` t1
  INNER JOIN `project.dataset.id_vectors` t2
  ON t1.id < t2.id
)
SELECT id_1, id_2, ( 
  SELECT 
    SUM(value1 * value2)/ 
    SQRT(SUM(value1 * value1))/ 
    SQRT(SUM(value2 * value2))
  FROM UNNEST(a.coords) value1 WITH OFFSET pos1 
  JOIN UNNEST(b.coords) value2 WITH OFFSET pos2 
  ON pos1 = pos2
  ) cosine_similarity
FROM pairwise t
JOIN `project.dataset.id_vectors` a ON a.id = id_1
JOIN `project.dataset.id_vectors` b ON b.id = id_2

Obviously it works on small dummy set as you can see below:

#standardSQL
WITH `project.dataset.id_vectors` AS (
  SELECT 1 id, [1.0, 2.0, 3.0, 4.0] coords UNION ALL
  SELECT 2, [1.0, 2.0, 3.0, 4.0] UNION ALL
  SELECT 3, [2.0, 0.0, 1.0, 1.0] UNION ALL
  SELECT 4, [0, 2.0, 1.0, 1.0] UNION ALL 
  SELECT 5, [2.0, 1.0, 1.0, 0.0] UNION ALL
  SELECT 6, [1.0, 1.0, 1.0, 1.0]
), pairwise AS (
  SELECT t1.id AS id_1, t2.id AS id_2
  FROM `project.dataset.id_vectors` t1
  INNER JOIN `project.dataset.id_vectors` t2
  ON t1.id < t2.id
)
SELECT id_1, id_2, ( 
  SELECT 
    SUM(value1 * value2)/ 
    SQRT(SUM(value1 * value1))/ 
    SQRT(SUM(value2 * value2))
  FROM UNNEST(a.coords) value1 WITH OFFSET pos1 
  JOIN UNNEST(b.coords) value2 WITH OFFSET pos2 
  ON pos1 = pos2
  ) cosine_similarity
FROM pairwise t
JOIN `project.dataset.id_vectors` a ON a.id = id_1
JOIN `project.dataset.id_vectors` b ON b.id = id_2

with result

Row id_1    id_2    cosine_similarity    
1   1       2       1.0  
2   1       3       0.6708203932499369   
3   1       4       0.819891591749923    
4   1       5       0.521749194749951    
5   1       6       0.9128709291752769   
6   2       3       0.6708203932499369   
7   2       4       0.819891591749923    
8   2       5       0.521749194749951    
9   2       6       0.9128709291752769   
10  3       4       0.3333333333333334   
11  3       5       0.8333333333333335   
12  3       6       0.8164965809277261   
13  4       5       0.5000000000000001   
14  4       6       0.8164965809277261   
15  5       6       0.8164965809277261     

So, try on your real data and let's see how it will work for you :o)

And ... obviously you should pre-create / materialize pairwise table

Another optimization idea is to have pre-calculated values of SQRT(SUM(value1 * value1)) in your project.dataset.id_vectors - this can save quite CPU - this should be simple adjustment so I leave it to you :o)

这篇关于在Bigquery中计算大量向量之间的成对余弦相似度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆