如何在Hive中生成所有n-gram [英] How to generate all n-grams in Hive

查看:174
本文介绍了如何在Hive中生成所有n-gram的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用HiveQL创建一个n-gram列表.我的想法是使用带前瞻性和split函数的正则表达式-但这不起作用,

I'd like to create a list of n-grams using HiveQL. My idea was to use a regex with a lookahead and the split function - this does not work, though:

select split('This is my sentence', '(\\S+) +(?=(\\S+))');

输入是以下形式的列

|sentence                 |
|-------------------------|
|This is my sentence      |
|This is another sentence |

输出应该是:

["This is","is my","my sentence"]
["This is","is another","another sentence"]

Hive中有一个n-gram udf,但是该函数直接计算n-gram的频率-我想改为列出所有n-gram的列表.

There is an n-grams udf in Hive but the function directly calculates the frequency of the n-grams - I'd like to have a list of all the n-grams instead.

非常感谢!

推荐答案

这可能不是最佳解决方案,但却是非常有效的解决方案.用定界符分隔句子(在我的示例中是一个或多个空格或逗号),然后爆炸并合并以得到n-gram,然后使用collect_set(如果需要唯一的n-gram)组装n-gram数组. c1>:

This is maybe not the most optimal but quite working solution. Split sentence by delimiter (in my example it is one or more space or comma), then explode and join to get n-grams, then assemble array of n-grams using collect_set (if you need unique n-grams) or collect_list:

with src as 
(
select source_data.sentence, words.pos, words.word
  from
      (--Replace this subquery (source_data) with your table
       select stack (2,
                     'This is my sentence', 
                     'This is another sentence'
                     ) as sentence
      ) source_data 
        --split and explode words
        lateral view posexplode(split(sentence, '[ ,]+')) words as pos, word
)

select s1.sentence, collect_set(concat_ws(' ',s1.word, s2.word)) as ngrams 
      from src s1 
           inner join src s2 on s1.sentence=s2.sentence and s1.pos+1=s2.pos              
group by s1.sentence;

结果:

OK
This is another sentence        ["This is","is another","another sentence"]
This is my sentence             ["This is","is my","my sentence"]
Time taken: 67.832 seconds, Fetched: 2 row(s)

这篇关于如何在Hive中生成所有n-gram的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆