如何使用基于多列的bigquery关联? [英] How to use bigquery correlation based on many columns?
问题描述
这个模式是:
id:integer,feature1:float,feature2:float,...,feature100:float
编辑这不是滚动窗口时间序列相关性问题。每行都是100个特征的观察结果,我想用bigquery为每行查找前N个相似的观察值。 解决方案
你想要找出每列与其他列之间的相关性吗?
这就像这样:
SELECT CORR(col1,col2),CORR(col1,col3),CORR(col1,col4),...,CORR(col99,col100)
FROM [mytable]
这可能需要很长时间才能写入(除非将其自动化)。作为一种替代方案,考虑一个不同的模式,其中所有内容都位于3列。转换可以像这样运行:
$ p $ SELECT $ col $ value $ rowid FROM $ b $(SELECT'col1'as colname, col1,rowid AS value FROM [mytable]),
(SELECT'col2'AS colname,col2,rowid AS value FROM [mytable]),
(SELECT'col3'AS colname,col3,rowid AS值FROM [mytable]),
...
(SELECT'col100'as colname,col100 AS value FROM [mytable])
使用此模式,您可以使用更简单的查询运行所有组合的列关联:
SELECT CORR(a.value,b.value)corr,a.colname,b.colname
FROM [my_new_table] a
加入每个[my_new_table] b
ON a.rowid = b.rowid
WHERE a.colname> b.colname
GROUP BY a.colname,b.colname
(这就是我对@Tjorriemorrie链接的文章所做的 - http://googlecloudplatform.blo gspot.mx/2013/09/introducing-corr-to-google-bigquery.html )
请注意,第一个查询可能更复杂,因此最后一个,但我怀疑它将需要更少的时间来运行,因为不需要洗牌。
由于此问题询问行,所以初始转换将类似,但稍有不同:
SELECT列,值,rowid FROM
(SELECT'c1'列,c1 AS值, rowid FROM [mytable]),
(SELECT'c2'列,c2 AS值,rowid FROM [mytable]),
(SELECT'c3'列,c3 AS值,rowid FROM [mytable])
然后,行之间的相关性计算如下:
SELECT CORR(a.value,b.value),a.rowid,b.rowid
FROM [my_new_table] a
加入每一个[my_new_table ] b
ON a.column = b.column
WHERE a.rowid< b.rowid
GROUP BY a.rowid,b.rowid
Given a dataset of 100k rows and 100 columns, how is it possible to use bigquery CORR() to find the correlation between the rows?
The schema is:
id:integer, feature1:float, feature2:float, ..., feature100:float
Edit This is not a rolling window time series correlation problem. Each row is an observation of 100 features, and I'd like to use bigquery to find the top N similar observations for each row.
You want to find the correlation between each column and the other columns?
That would be something like this:
SELECT CORR(col1, col2), CORR(col1, col3), CORR(col1, col4),..., CORR(col99, col100)
FROM [mytable]
That might take a long time to write (unless you automate it). As an alternative, consider a different schema where everything lives in 3 columns. The transformation would run like this:
SELECT colname, value, rowid FROM
(SELECT 'col1' AS colname, col1, rowid AS value FROM [mytable]),
(SELECT 'col2' AS colname, col2, rowid AS value FROM [mytable]),
(SELECT 'col3' AS colname, col3, rowid AS value FROM [mytable]),
...
(SELECT 'col100' AS colname, col100 AS value FROM [mytable])
With this schema you can run all the combined column correlations with a simpler query:
SELECT CORR(a.value, b.value) corr, a.colname, b.colname
FROM [my_new_table] a
JOIN EACH [my_new_table] b
ON a.rowid=b.rowid
WHERE a.colname>b.colname
GROUP BY a.colname, b.colname
(That's what I did on the article linked by @Tjorriemorrie - http://googlecloudplatform.blogspot.mx/2013/09/introducing-corr-to-google-bigquery.html)
Note that the first query might be more complex that this last one, but I suspect it will take less time to run, as no shuffling will be required.
Since this question asks about rows, the initial transformation would be similar, but slightly different:
SELECT column, value, rowid FROM
(SELECT 'c1' column, c1 AS value, rowid FROM [mytable]),
(SELECT 'c2' column, c2 AS value, rowid FROM [mytable]),
(SELECT 'c3' column, c3 AS value, rowid FROM [mytable])
Then the correlation between rows would be computed as in:
SELECT CORR(a.value, b.value), a.rowid, b.rowid
FROM [my_new_table] a
JOIN EACH [my_new_table] b
ON a.column=b.column
WHERE a.rowid < b.rowid
GROUP BY a.rowid, b.rowid
这篇关于如何使用基于多列的bigquery关联?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!