使用 Hive 计算文本变量的单词频率 [英] Count Frequency of words of a Text variable with Hive

查看:30
本文介绍了使用 Hive 计算文本变量的单词频率的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个变量,每一行都是一个句子.示例:

I have a variable that every row is a sentence. Example:

 -Row1 "Hey, how are you?
 -Rwo2 "Hey, Who is there?

我希望输出是按单词分组的计数.

I want that the output is the count group by word.

示例:

Hey 2
How 1
are 1
...

我正在使用 split bit 功能,但我有点卡住了.对此有什么想法吗?

I am using split a bit funtion but I am a bit stuck. Any thoughts on this?

谢谢!

推荐答案

这在 Hive 中是可能的.按非字母字符拆分,使用横向视图+爆炸,然后计算字数:

This is possible in Hive. Split by non-alpha characters and use lateral view+explode, then count words:

with your_data as(
select stack(2,
'Hey, how are you?',
'Hey, Who is there?'
) as initial_string
)

select w.word, count(*) cnt
from
(
select split(lower(initial_string),'[^a-zA-Z]+') words from your_data
)s lateral view explode(words) w as word
where w.word!=''
group by w.word;

结果:

word    cnt
are     1
hey     2
how     1
is      1
there   1
who     1
you     1

另一种方法使用 sentences 函数,它返回标记化句子的数组(单词数组的数组):

One more method using sentences function, it returns array of tokenized sentences (array of array of words):

with your_data as(
select stack(2,
'Hey, how are you?',
'Hey, Who is there?'
) as initial_string
)

select w.word, count(*) cnt
from
(
select sentences(lower(initial_string)) sentences from your_data
)d lateral view explode(sentences) s as sentence
   lateral view explode(s.sentence) w as word
group by w.word;

结果:

word    cnt
are     1
hey     2
how     1
is      1
there   1
who     1
you     1

sentences(string str, string lang, string locale) 函数将一串自然语言文本标记为单词和句子,其中每个句子在适当的句子边界处断开并作为单词数组返回.'lang' 和 'locale' 是可选参数.例如,sentence('Hello there! How are you?') 返回 [["Hello", "there"], ["How", "are", "you"]]

sentences(string str, string lang, string locale) function tokenizes a string of natural language text into words and sentences, where each sentence is broken at the appropriate sentence boundary and returned as an array of words. The 'lang' and 'locale' are optional arguments. For example, sentences('Hello there! How are you?') returns [["Hello", "there"], ["How", "are", "you"]]

这篇关于使用 Hive 计算文本变量的单词频率的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆