用大量数据构建BigQuery作为输入 [英] Structuring BigQuery with large array of data as input

查看：96 发布时间：2018/5/7 17:40:25 google-bigquery

本文介绍了用大量数据构建BigQuery作为输入的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有兴趣通过BigQuery的能力查找trigrams数据获得与特定单词关联最频繁的单词。例如，当使用Google的 Ngram viewer ，我可以输入 code>，它会给我跟随伟大的最频繁关联的词，例如很好，然后是伟大和伟大。我的目标是做一大串单词，这样我就可以用 word1 * 一路查询到 word10000 *

 
 
 关于此的讨论SO答案，我被引导至BigQuery的公开可用的trigram数据。在这一点上，我似乎无法弄清楚的是如何使用此服务输入一组单词，无论是作为文件输入还是粘贴它们的方式。非常感谢任何帮助 - 谢谢。 
解决方案
以下是您如何找到10个最常用的单词，以遵循优秀：
  SELECT second，SUM（cell.page_count）total 
 FROM [publicdata：samples.trigrams] 
 WHERE first =great
 group by 1 
 order by 2 desc 
 limit 10 
  
  
 
 
 第二个总额
 ------------------ 
交易3048832 
和1689911 
，1576341 
a 1019511 
编号984993 
许多875974 
重要性805215 
部分739409 
。 700694 
为628978

如果您想限制在特定年份 - 比如1820年至1840年，那么你也可以限制cell.value（这是发布的年份）。 pre $ SELECT $ SELECT SUM（cell.page_count）总数FROM [publicdata：samples.trigrams] WHERE first =great和cell.value介于'1820'和'1840'之间 group by 1 order by 2 desc 限制10

I am interested in obtaining the most frequently word associations with a particular word via BigQuery's ability find trigrams data. For example, when using Google's Ngram viewer, I could input great *, which will give me the most frequently associated word that follows "great", such as "great deal", then "great and" and "great many". My goal is to do it for a large list of words so that I could query with word1 * all the way to word10000 *

Following the discussion on this SO answer, I was led to the BigQuery's publicly available trigram data. What I can't seem to figure out at this point is how to use this service with input of an array of words, either as a file input or a way to paste them in. Any assistance is much appreciated - thanks.

解决方案

Here is how you would find 10 most frequent words to follow "great":

SELECT second, SUM(cell.page_count) total 
FROM [publicdata:samples.trigrams] 
WHERE first = "great"
group by 1
order by 2 desc
limit 10

This results in

second     total     
------------------
deal       3048832   
and        1689911   
,          1576341   
a          1019511   
number     984993    
many       875974    
importance 805215    
part       739409    
.          700694    
as         628978

If you wanted to limit to specific years - say between 1820 and 1840, then you can also restrict on cell.value (which is year of publication)

SELECT second, SUM(cell.page_count) total FROM [publicdata:samples.trigrams] 
WHERE first = "great" and cell.value between '1820' and '1840'
group by 1
order by 2 desc
limit 10

这篇关于用大量数据构建BigQuery作为输入的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

用大量数据构建BigQuery作为输入 [英] Structuring BigQuery with large array of data as input

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

用大量数据构建BigQuery作为输入 [英] Structuring BigQuery with large array of data as input

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭