从一个范围中提取和计算唯一的词频 [英] Extracting and counting unique word frequency from a range

查看:29
本文介绍了从一个范围中提取和计算唯一的词频的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一列,每一行都是一个句子.例如:

I have a column where each row is a sentence. For example:

COLUMN1

R1: -Do you think they'll come, sir?

R2: -Oh they'll come, they'll come all right.

R3: Here. Stamp those and mail them.

R4: It's ringing.

R5: Would you walk Myron the other way?

从这个范围中,我想提取一个唯一词列表(COLUMN2),以及它们在范围内出现的频率(COLUMN3)的计数.

From this range, I want to extract a list of unique words (COLUMN2), and a count of how often they appeared in the range (COLUMN3).

诀窍是删除标点符号,如逗号、句号等.

The trick is to remove punctuation marks like commas, periods, etc..

所以上面的期望结果是:

So the desired result for the above would be:

COLUMN2    COLUMN3

Do          1

you         2

think       1

they'll     3

come        2

sir         1

Oh          1

all         1

right       1

Here        1

Stamp       1

those       1

and         1

mail        1

them        1

It's        1

ringing     1

Would       1

walk        1

Myron       1

the         1

other       1

way         1

我尝试使用 SPLIT 函数解析每一行,将每个单词分成自己的单元格,但我一直在删除标点符号并构建唯一单词列表(我知道这将涉及 UNIQUE 函数).我猜的计数也会涉及到 COUNTUNIQUE 函数.

I tried parsing each row with the SPLIT function, separating each word into their own cells, but I'm stuck removing the punctuation, and building the list of unique words (which I know will involve the UNIQUE function). The count I'm guessing will also involve the COUNTUNIQUE function.

任何指导将不胜感激!

推荐答案

你可以尝试类似的事情

=query(ArrayFormula(transpose(split(query(regexreplace(A1:A5, "[^A-Za-z\s/']" ,""),,50000)," "))), "Select Col1, Count(Col1) where Col1 <>'' group by Col1 label Count(Col1)''")

改变范围以适应.

如果你想排除一个单词列表(例如在 J1:J20 范围内),你可以尝试

If you want to exclude a list of words (ex. in the range J1:J20) you can try

=ArrayFormula(query(transpose(split(query(regexreplace(A1:A5, "[^A-Za-z\s/']" ,""),,50000)," ")), "Select Col1, Count(Col1) where not UPPER(Col1) matches '\b"&textjoin("|", 1, UPPER(J1:J20))&"\b'  group by Col1 order by Count(Col1) desc label Count(Col1)''"))

或者,您也可以将排除列表添加到正则表达式模式...

Alternatively, you can also add the list of exclusions to the regex pattern...

=query(ArrayFormula(transpose(split(query(regexreplace(A1:A5, "[^A-Za-z\s/']|\b((?i)the|oh|or|and)\b" ,""),,50000)," "))), "Select Col1, Count(Col1) where Col1 <>''  group by Col1 order by Count(Col1) desc label Count(Col1)''")

更新:

=ArrayFormula(substitute(query(transpose(split(query(regexreplace(substitute(C11:C, char(39), "_"), "[^A-Za-z\s_]" ,""),,50000)," ")), "Select Col1, Count(Col1) where not UPPER(Col1) matches '\b"&textjoin("|", 1, UPPER(substitute(G11:G,char(39),"_")))&"\b'  group by Col1 order by Count(Col1) desc label Count(Col1)''", 0), "_", char(39)))

或者,使用不同的方法

=query(filter(regexreplace(transpose(split(query(regexreplace(C11:C, "[^A-Za-z\s'-]" ,""),,50000)," ")), "^-",), isna(match(upper(regexreplace(transpose(split(query(regexreplace(C11:C, "[^A-Za-z\s'-]" ,""),,50000)," ")), "^-",)), upper(filter(G11:G, len(G11:G))),0))), "Select Col1, count(Col1) group by Col1 order by count(Col1) desc label count(Col1)''", 0)

这篇关于从一个范围中提取和计算唯一的词频的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆