如何使用基于多列的bigquery关联？ [英] How to use bigquery correlation based on many columns?

查看：106 发布时间：2018/5/7 17:34:07 google-bigquery

本文介绍了如何使用基于多列的bigquery关联？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

给定一个100k行和100列的数据集，如何使用bigquery CORR（）来查找行之间的相关性？

这个模式是：

  id：integer，feature1：float，feature2：float，...，feature100：float

编辑这不是滚动窗口时间序列相关性问题。每行都是100个特征的观察结果，我想用bigquery为每行查找前N个相似的观察值。 解决方案

你想要找出每列与其他列之间的相关性吗？

这就像这样：

SELECT CORR（col1，col2），CORR（col1，col3），CORR（col1，col4），...，CORR（col99，col100） FROM [mytable]
这可能需要很长时间才能写入（除非将其自动化）。作为一种替代方案，考虑一个不同的模式，其中所有内容都位于3列。转换可以像这样运行：

$ p $ SELECT $ col $ value $ rowid FROM $ b $（SELECT'col1'as colname， col1，rowid AS value FROM [mytable]），
（SELECT'col2'AS colname，col2，rowid AS value FROM [mytable]），
（SELECT'col3'AS colname，col3，rowid AS值FROM [mytable]），
...
（SELECT'col100'as colname，col100 AS value FROM [mytable]）

使用此模式，您可以使用更简单的查询运行所有组合的列关联：

SELECT CORR（a.value，b.value）corr，a.colname，b.colname FROM [my_new_table] a 加入每个[my_new_table] b ON a.rowid = b.rowid WHERE a.colname> b.colname GROUP BY a.colname，b.colname
（这就是我对@Tjorriemorrie链接的文章所做的 - http：//googlecloudplatform.blo gspot.mx/2013/09/introducing-corr-to-google-bigquery.html ）

请注意，第一个查询可能更复杂，因此最后一个，但我怀疑它将需要更少的时间来运行，因为不需要洗牌。

由于此问题询问行，所以初始转换将类似，但稍有不同：

SELECT列，值，rowid FROM （SELECT'c1'列，c1 AS值， rowid FROM [mytable]），（SELECT'c2'列，c2 AS值，rowid FROM [mytable]），（SELECT'c3'列，c3 AS值，rowid FROM [mytable]）
然后，行之间的相关性计算如下：

SELECT CORR（a.value，b.value），a.rowid，b.rowid FROM [my_new_table] a 加入每一个[my_new_table ] b ON a.column = b.column WHERE a.rowid< b.rowid GROUP BY a.rowid，b.rowid

Given a dataset of 100k rows and 100 columns, how is it possible to use bigquery CORR() to find the correlation between the rows?

The schema is:
id:integer, feature1:float, feature2:float, ..., feature100:float
Edit This is not a rolling window time series correlation problem. Each row is an observation of 100 features, and I'd like to use bigquery to find the top N similar observations for each row.
解决方案
You want to find the correlation between each column and the other columns?

That would be something like this:
SELECT CORR(col1, col2), CORR(col1, col3), CORR(col1, col4),..., CORR(col99, col100) FROM [mytable]
That might take a long time to write (unless you automate it). As an alternative, consider a different schema where everything lives in 3 columns. The transformation would run like this:
SELECT colname, value, rowid FROM (SELECT 'col1' AS colname, col1, rowid AS value FROM [mytable]), (SELECT 'col2' AS colname, col2, rowid AS value FROM [mytable]), (SELECT 'col3' AS colname, col3, rowid AS value FROM [mytable]), ... (SELECT 'col100' AS colname, col100 AS value FROM [mytable])
With this schema you can run all the combined column correlations with a simpler query:
SELECT CORR(a.value, b.value) corr, a.colname, b.colname FROM [my_new_table] a JOIN EACH [my_new_table] b ON a.rowid=b.rowid WHERE a.colname>b.colname GROUP BY a.colname, b.colname
(That's what I did on the article linked by @Tjorriemorrie - http://googlecloudplatform.blogspot.mx/2013/09/introducing-corr-to-google-bigquery.html)

Note that the first query might be more complex that this last one, but I suspect it will take less time to run, as no shuffling will be required.

Since this question asks about rows, the initial transformation would be similar, but slightly different:
SELECT column, value, rowid FROM (SELECT 'c1' column, c1 AS value, rowid FROM [mytable]), (SELECT 'c2' column, c2 AS value, rowid FROM [mytable]), (SELECT 'c3' column, c3 AS value, rowid FROM [mytable])
Then the correlation between rows would be computed as in:
SELECT CORR(a.value, b.value), a.rowid, b.rowid FROM [my_new_table] a JOIN EACH [my_new_table] b ON a.column=b.column WHERE a.rowid < b.rowid GROUP BY a.rowid, b.rowid

这篇关于如何使用基于多列的bigquery关联？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何使用基于多列的bigquery关联？ [英] How to use bigquery correlation based on many columns?

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何使用基于多列的bigquery关联？ [英] How to use bigquery correlation based on many columns?

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭