用大量数据构建BigQuery作为输入 [英] Structuring BigQuery with large array of data as input
问题描述
我有兴趣通过BigQuery的能力查找trigrams数据获得与特定单词关联最频繁的单词。例如,当使用Google的 Ngram viewer ,我可以输入 code>,它会给我跟随伟大的最频繁关联的词,例如很好,然后是伟大和伟大。我的目标是做一大串单词,这样我就可以用
word1 *
一路查询到 word10000 *
关于此的讨论SO答案,我被引导至BigQuery的公开可用的trigram数据。在这一点上,我似乎无法弄清楚的是如何使用此服务输入一组单词,无论是作为文件输入还是粘贴它们的方式。非常感谢任何帮助 - 谢谢。
以下是您如何找到10个最常用的单词,以遵循优秀:
SELECT second,SUM(cell.page_count)total
FROM [publicdata:samples.trigrams]
WHERE first =great
group by 1
order by 2 desc
limit 10
第二个总额
------------------
交易3048832
和1689911
,1576341
a 1019511
编号984993
许多875974
重要性805215
部分739409
。 700694
为628978
如果您想限制在特定年份 - 比如1820年至1840年,那么你也可以限制cell.value(这是发布的年份)。
pre $ SELECT $ SELECT SUM(cell.page_count)总数FROM [publicdata:samples.trigrams]
WHERE first =great和cell.value介于'1820'和'1840'之间
group by 1
order by 2 desc
限制10
I am interested in obtaining the most frequently word associations with a particular word via BigQuery's ability find trigrams data. For example, when using Google's Ngram viewer, I could input great *
, which will give me the most frequently associated word that follows "great", such as "great deal", then "great and" and "great many". My goal is to do it for a large list of words so that I could query with word1 *
all the way to word10000 *
Following the discussion on this SO answer, I was led to the BigQuery's publicly available trigram data. What I can't seem to figure out at this point is how to use this service with input of an array of words, either as a file input or a way to paste them in. Any assistance is much appreciated - thanks.
Here is how you would find 10 most frequent words to follow "great":
SELECT second, SUM(cell.page_count) total
FROM [publicdata:samples.trigrams]
WHERE first = "great"
group by 1
order by 2 desc
limit 10
This results in
second total
------------------
deal 3048832
and 1689911
, 1576341
a 1019511
number 984993
many 875974
importance 805215
part 739409
. 700694
as 628978
If you wanted to limit to specific years - say between 1820 and 1840, then you can also restrict on cell.value (which is year of publication)
SELECT second, SUM(cell.page_count) total FROM [publicdata:samples.trigrams]
WHERE first = "great" and cell.value between '1820' and '1840'
group by 1
order by 2 desc
limit 10
这篇关于用大量数据构建BigQuery作为输入的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!