如何在 Hive 中生成所有 n-gram [英] How to generate all n-grams in Hive
问题描述
我想使用 HiveQL 创建一个 n-gram 列表.我的想法是使用带有前瞻和拆分功能的正则表达式 - 但是这不起作用:
I'd like to create a list of n-grams using HiveQL. My idea was to use a regex with a lookahead and the split function - this does not work, though:
select split('This is my sentence', '(\S+) +(?=(\S+))');
输入是表格的一列
|sentence |
|-------------------------|
|This is my sentence |
|This is another sentence |
输出应该是:
["This is","is my","my sentence"]
["This is","is another","another sentence"]
Hive 中有一个 n-gram udf,但该函数直接计算 n-gram 的频率 - 我想要一个所有 n-gram 的列表.
There is an n-grams udf in Hive but the function directly calculates the frequency of the n-grams - I'd like to have a list of all the n-grams instead.
非常感谢!
推荐答案
这可能不是最佳但非常有效的解决方案.通过分隔符分割句子(在我的例子中它是一个或多个空格或逗号),然后爆炸并加入以获得 n-gram,然后使用 collect_set
组装 n-gram 数组(如果你需要唯一的 n-grams) 或 collect_list
:
This is maybe not the most optimal but quite working solution. Split sentence by delimiter (in my example it is one or more space or comma), then explode and join to get n-grams, then assemble array of n-grams using collect_set
(if you need unique n-grams) or collect_list
:
with src as
(
select source_data.sentence, words.pos, words.word
from
(--Replace this subquery (source_data) with your table
select stack (2,
'This is my sentence',
'This is another sentence'
) as sentence
) source_data
--split and explode words
lateral view posexplode(split(sentence, '[ ,]+')) words as pos, word
)
select s1.sentence, collect_set(concat_ws(' ',s1.word, s2.word)) as ngrams
from src s1
inner join src s2 on s1.sentence=s2.sentence and s1.pos+1=s2.pos
group by s1.sentence;
结果:
OK
This is another sentence ["This is","is another","another sentence"]
This is my sentence ["This is","is my","my sentence"]
Time taken: 67.832 seconds, Fetched: 2 row(s)
这篇关于如何在 Hive 中生成所有 n-gram的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!