如何在 Hive 中生成所有 n-gram [英] How to generate all n-grams in Hive

查看:29
本文介绍了如何在 Hive 中生成所有 n-gram的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用 HiveQL 创建一个 n-gram 列表.我的想法是使用带有前瞻和拆分功能的正则表达式 - 但是这不起作用:

I'd like to create a list of n-grams using HiveQL. My idea was to use a regex with a lookahead and the split function - this does not work, though:

select split('This is my sentence', '(\S+) +(?=(\S+))');

输入是表格的一列

|sentence                 |
|-------------------------|
|This is my sentence      |
|This is another sentence |

输出应该是:

["This is","is my","my sentence"]
["This is","is another","another sentence"]

Hive 中有一个 n-gram udf,但该函数直接计算 n-gram 的频率 - 我想要一个所有 n-gram 的列表.

There is an n-grams udf in Hive but the function directly calculates the frequency of the n-grams - I'd like to have a list of all the n-grams instead.

非常感谢!

推荐答案

这可能不是最佳但非常有效的解决方案.通过分隔符分割句子(在我的例子中它是一个或多个空格或逗号),然后爆炸并加入以获得 n-gram,然后使用 collect_set 组装 n-gram 数组(如果你需要唯一的 n-grams) 或 collect_list:

This is maybe not the most optimal but quite working solution. Split sentence by delimiter (in my example it is one or more space or comma), then explode and join to get n-grams, then assemble array of n-grams using collect_set (if you need unique n-grams) or collect_list:

with src as 
(
select source_data.sentence, words.pos, words.word
  from
      (--Replace this subquery (source_data) with your table
       select stack (2,
                     'This is my sentence', 
                     'This is another sentence'
                     ) as sentence
      ) source_data 
        --split and explode words
        lateral view posexplode(split(sentence, '[ ,]+')) words as pos, word
)

select s1.sentence, collect_set(concat_ws(' ',s1.word, s2.word)) as ngrams 
      from src s1 
           inner join src s2 on s1.sentence=s2.sentence and s1.pos+1=s2.pos              
group by s1.sentence;

结果:

OK
This is another sentence        ["This is","is another","another sentence"]
This is my sentence             ["This is","is my","my sentence"]
Time taken: 67.832 seconds, Fetched: 2 row(s)

这篇关于如何在 Hive 中生成所有 n-gram的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆