如何使用基于多列的bigquery关联? [英] How to use bigquery correlation based on many columns?

查看:106
本文介绍了如何使用基于多列的bigquery关联?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

给定一个100k行和100列的数据集,如何使用bigquery CORR()来查找行之间的相关性?



这个模式是:

  id:integer,feature1:float,feature2:float,...,feature100:float 

编辑这不是滚动窗口时间序列相关性问题。每行都是100个特征的观察结果,我想用bigquery为每行查找前N个相似的观察值。 解决方案

你想要找出每列与其他列之间的相关性吗?



这就像这样:

  SELECT CORR(col1,col2),CORR(col1,col3),CORR(col1,col4),...,CORR(col99,col100)
FROM [mytable]

这可能需要很长时间才能写入(除非将其自动化)。作为一种替代方案,考虑一个不同的模式,其中所有内容都位于3列。转换可以像这样运行:

$ p $ SELECT $ col $ value $ rowid FROM $ b $(SELECT'col1'as colname, col1,rowid AS value FROM [mytable]),
(SELECT'col2'AS colname,col2,rowid AS value FROM [mytable]),
(SELECT'col3'AS colname,col3,rowid AS值FROM [mytable]),
...
(SELECT'col100'as colname,col100 AS value FROM [mytable])

使用此模式,您可以使用更简单的查询运行所有组合的列关联:

  SELECT CORR(a.value,b.value)corr,a.colname,b.colname 
FROM [my_new_table] a
加入每个[my_new_table] b
ON a.rowid = b.rowid
WHERE a.colname> b.colname
GROUP BY a.colname,b.colname

(这就是我对@Tjorriemorrie链接的文章所做的 - http://googlecloudplatform.blo gspot.mx/2013/09/introducing-corr-to-google-bigquery.html



请注意,第一个查询可能更复杂,因此最后一个,但我怀疑它将需要更少的时间来运行,因为不需要洗牌。



由于此问题询问行,所以初始转换将类似,但稍有不同:

  SELECT列,值,rowid FROM 
(SELECT'c1'列,c1 AS值, rowid FROM [mytable]),
(SELECT'c2'列,c2 AS值,rowid FROM [mytable]),
(SELECT'c3'列,c3 AS值,rowid FROM [mytable])

然后,行之间的相关性计算如下:

  SELECT CORR(a.value,b.value),a.rowid,b.rowid 
FROM [my_new_table] a
加入每一个[my_new_table ] b
ON a.column = b.column
WHERE a.rowid< b.rowid
GROUP BY a.rowid,b.rowid


Given a dataset of 100k rows and 100 columns, how is it possible to use bigquery CORR() to find the correlation between the rows?

The schema is:

id:integer, feature1:float, feature2:float, ..., feature100:float

Edit This is not a rolling window time series correlation problem. Each row is an observation of 100 features, and I'd like to use bigquery to find the top N similar observations for each row.

解决方案

You want to find the correlation between each column and the other columns?

That would be something like this:

SELECT CORR(col1, col2), CORR(col1, col3), CORR(col1, col4),..., CORR(col99, col100)
FROM [mytable]

That might take a long time to write (unless you automate it). As an alternative, consider a different schema where everything lives in 3 columns. The transformation would run like this:

SELECT colname, value, rowid FROM
(SELECT 'col1' AS colname, col1, rowid AS value FROM [mytable]),
(SELECT 'col2' AS colname, col2, rowid AS value FROM [mytable]),
(SELECT 'col3' AS colname, col3, rowid AS value FROM [mytable]),
...
(SELECT 'col100' AS colname, col100 AS value FROM [mytable])

With this schema you can run all the combined column correlations with a simpler query:

SELECT CORR(a.value, b.value) corr, a.colname, b.colname
FROM [my_new_table] a
JOIN EACH [my_new_table] b
ON a.rowid=b.rowid
WHERE a.colname>b.colname
GROUP BY a.colname, b.colname

(That's what I did on the article linked by @Tjorriemorrie - http://googlecloudplatform.blogspot.mx/2013/09/introducing-corr-to-google-bigquery.html)

Note that the first query might be more complex that this last one, but I suspect it will take less time to run, as no shuffling will be required.

Since this question asks about rows, the initial transformation would be similar, but slightly different:

SELECT column, value, rowid FROM
  (SELECT 'c1' column, c1 AS value, rowid FROM [mytable]),
  (SELECT 'c2' column, c2 AS value, rowid FROM [mytable]),
  (SELECT 'c3' column, c3 AS value, rowid FROM [mytable]) 

Then the correlation between rows would be computed as in:

SELECT CORR(a.value, b.value), a.rowid, b.rowid
FROM [my_new_table] a
JOIN EACH [my_new_table] b
ON a.column=b.column
WHERE a.rowid < b.rowid
GROUP BY a.rowid, b.rowid

这篇关于如何使用基于多列的bigquery关联?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆