用大量数据构建BigQuery作为输入 [英] Structuring BigQuery with large array of data as input

查看:96
本文介绍了用大量数据构建BigQuery作为输入的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有兴趣通过BigQuery的能力查找trigrams数据获得与特定单词关联最频繁的单词。例如,当使用Google的 Ngram viewer ,我可以输入 code>,它会给我跟随伟大的最频繁关联的词,例如很好,然后是伟大和伟大。我的目标是做一大串单词,这样我就可以用 word1 * 一路查询到 word10000 *



关于此的讨论SO答案,我被引导至BigQuery的公开可用的trigram数据。在这一点上,我似乎无法弄清楚的是如何使用此服务输入一组单词,无论是作为文件输入还是粘贴它们的方式。非常感谢任何帮助 - 谢谢。

解决方案

以下是您如何找到10个最常用的单词,以遵循优秀:

  SELECT second,SUM(cell.page_count)total 
FROM [publicdata:samples.trigrams]
WHERE first =great
group by 1
order by 2 desc
limit 10



 第二个总额
------------------
交易3048832
和1689911
,1576341
a 1019511
编号984993
许多875974
重要性805215
部分739409
。 700694
为628978

如果您想限制在特定年份 - 比如1820年至1840年,那么你也可以限制cell.value(这是发布的年份)。

pre $ SELECT $ SELECT SUM(cell.page_count)总数FROM [publicdata:samples.trigrams]
WHERE first =great和cell.value介于'1820'和'1840'之间
group by 1
order by 2 desc
限制10


I am interested in obtaining the most frequently word associations with a particular word via BigQuery's ability find trigrams data. For example, when using Google's Ngram viewer, I could input great *, which will give me the most frequently associated word that follows "great", such as "great deal", then "great and" and "great many". My goal is to do it for a large list of words so that I could query with word1 * all the way to word10000 *

Following the discussion on this SO answer, I was led to the BigQuery's publicly available trigram data. What I can't seem to figure out at this point is how to use this service with input of an array of words, either as a file input or a way to paste them in. Any assistance is much appreciated - thanks.

解决方案

Here is how you would find 10 most frequent words to follow "great":

SELECT second, SUM(cell.page_count) total 
FROM [publicdata:samples.trigrams] 
WHERE first = "great"
group by 1
order by 2 desc
limit 10

This results in

second     total     
------------------
deal       3048832   
and        1689911   
,          1576341   
a          1019511   
number     984993    
many       875974    
importance 805215    
part       739409    
.          700694    
as         628978

If you wanted to limit to specific years - say between 1820 and 1840, then you can also restrict on cell.value (which is year of publication)

SELECT second, SUM(cell.page_count) total FROM [publicdata:samples.trigrams] 
WHERE first = "great" and cell.value between '1820' and '1840'
group by 1
order by 2 desc
limit 10

这篇关于用大量数据构建BigQuery作为输入的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆