使用 Hive 计算文本变量的单词频率 [英] Count Frequency of words of a Text variable with Hive
问题描述
我有一个变量,每一行都是一个句子.示例:
I have a variable that every row is a sentence. Example:
-Row1 "Hey, how are you?
-Rwo2 "Hey, Who is there?
我希望输出是按单词分组的计数.
I want that the output is the count group by word.
示例:
Hey 2
How 1
are 1
...
我正在使用 split bit 功能,但我有点卡住了.对此有什么想法吗?
I am using split a bit funtion but I am a bit stuck. Any thoughts on this?
谢谢!
推荐答案
这在 Hive 中是可能的.按非字母字符拆分,使用横向视图+爆炸,然后计算字数:
This is possible in Hive. Split by non-alpha characters and use lateral view+explode, then count words:
with your_data as(
select stack(2,
'Hey, how are you?',
'Hey, Who is there?'
) as initial_string
)
select w.word, count(*) cnt
from
(
select split(lower(initial_string),'[^a-zA-Z]+') words from your_data
)s lateral view explode(words) w as word
where w.word!=''
group by w.word;
结果:
word cnt
are 1
hey 2
how 1
is 1
there 1
who 1
you 1
另一种方法使用 sentences
函数,它返回标记化句子的数组(单词数组的数组):
One more method using sentences
function, it returns array of tokenized sentences (array of array of words):
with your_data as(
select stack(2,
'Hey, how are you?',
'Hey, Who is there?'
) as initial_string
)
select w.word, count(*) cnt
from
(
select sentences(lower(initial_string)) sentences from your_data
)d lateral view explode(sentences) s as sentence
lateral view explode(s.sentence) w as word
group by w.word;
结果:
word cnt
are 1
hey 2
how 1
is 1
there 1
who 1
you 1
sentences(string str, string lang, string locale) 函数将一串自然语言文本标记为单词和句子,其中每个句子在适当的句子边界处断开并作为单词数组返回.'lang' 和 'locale' 是可选参数.例如,sentence('Hello there! How are you?') 返回 [["Hello", "there"], ["How", "are", "you"]]代码>
sentences(string str, string lang, string locale) function tokenizes a string of natural language text into words and sentences, where each sentence is broken at the appropriate sentence boundary and returned as an array of words. The 'lang' and 'locale' are optional arguments. For example, sentences('Hello there! How are you?') returns [["Hello", "there"], ["How", "are", "you"]]
这篇关于使用 Hive 计算文本变量的单词频率的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!