Java Counting字符串中单词出现次数 [英] Java Counting # of occurrences of a word in a string

查看:145
本文介绍了Java Counting字符串中单词出现次数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在阅读一个大文本文件,我需要找出一些单词出现的次数。例如,单词 。我一行一行地每行都是一个字符串。

I have a large text file I am reading from and I need to find out how many times some words come up. For example, the word the. I'm doing this line by line each line is a string.

我需要确保我只计算合法 's - 其他中的不计算在内。这意味着我知道我需要以某种方式使用正则表达式。到目前为止我尝试的是:

I need to make sure that I only count legit the's--the the in other would not count. This means I know I need to use regular expressions in some way. What I was trying so far is this:

numSpace += line.split("[^a-z]the[^a-z]").length;  

我意识到正则表达式目前可能不正确但我试过没有那个并且只是试图找到单词 的出现,我也得到错误的数字。我的印象是,这会将字符串拆分成一个数组,并且该数组被拆分的次数是字符串中单词的次数。任何想法,我将不胜感激。

I realize the regular expression may not be correct at the moment but I tried without that and just tried to find occurrences of the word the and I get wrong numbers too. I was under the impression this would split the string up into an array and how many times that array was split up was how many times the word is in the string. Any ideas I would be grateful.

更新:
鉴于一些想法,我想出了这个:

Update: Given some ideas, I've come up with this:

numThe += line.split("[^a-zA-Z][Tt]he[^a-zA-Z]", -1).length - 1;

虽然仍然有一些奇怪的数字。我能够获得准确的一般计数(没有正则表达式),现在我的问题是正则表达式。

Though still getting some strange numbers. I was able to get an accurate general count (without the regular expression), now my issue is with the regexp.

推荐答案

使用拆分进行计数不是最有效的,但是如果你坚持这样做,正确的方法是:

Using split to count isn't the most efficient, but if you insist on doing that, the proper way is this:

haystack.split(needle, -1).length -1                            

如果你没有设置 limit -1 split 默认为 0 ,这会删除尾随空字符串,这会弄乱你的数量。

If you don't set limit to -1, split defaults to 0, which removes trailing empty strings, which messes up your count.

来自 API


limit参数控制模式的应用次数,因此会影响结果数组的长度。 [...]如果 n 为零,那么[...]尾随空字符串将被丢弃。

The limit parameter controls the number of times the pattern is applied and therefore affects the length of the resulting array. [...] If n is zero then [...] trailing empty strings will be discarded.

您还需要从数组的长度中减去1,因为 N 次出现分隔符将字符串拆分为 N + 1 部分。

You also need to subtract 1 from the length of the array, because N occurrences of the delimiter splits the string into N+1 parts.

至于正则表达式本身(即),您可以使用 \b 围绕。如果您允许 word 包含元字符(例如,计算出现$ US),您可能希望 Pattern.quote 它。

As for the regex itself (i.e. the needle), you can use \b the word boundary anchors around the word. If you allow word to contain metacharacters (e.g. count occurrences of "$US"), you may want to Pattern.quote it.


<我已经想出了这个:

I've come up with this:

numThe += line.split("[^a-zA-Z][Tt]he[^a-zA-Z]", -1).length - 1;

虽然仍然有一些奇怪的数字。我能够获得准确的一般计数(没有正则表达式),现在我的问题是正则表达式。

Though still getting some strange numbers. I was able to get an accurate general count (without the regular expression), now my issue is with the regexp.

现在的问题是你没有计算 [Tt]他这看起来是第一个或最后一个词,因为正则表达式说它必须在某个字符之前/之后,匹配 [^ a-zA-Z] (也就是说,你的比赛长度必须是5!)。您不允许根本不是字符的情况!

Now the issue is that you're not counting [Tt]he that appears as the first or last word, because the regex says that it has to be preceded/followed by some character, something that matches [^a-zA-Z] (that is, your match must be of length 5!). You're not allowing the case where there isn't a character at all!

您可以尝试这样的事情:

You can try something like this instead:

"(^|[^a-zA-Z])[Tt]he([^a-zA-Z]|$)"

这不是最简洁的解决方案,但它确实有效。

This isn't the most concise solution, but it works.

这样的事情(使用负面外观)也有效:

Something like this (using negative lookarounds) also works:

"(?<![a-zA-Z])[Tt]he(?![^a-zA-Z])"

这样可以匹配 [Tt]他,没有任何额外的字符,就像你之前的解决方案一样。如果你真的想要处理由 split 返回的标记,这是相关的,因为在这种情况下,分隔符不会从标记中窃取任何内容。

This has the benefit of matching just [Tt]he, without any extra characters around it like your previous solution did. This is relevant in case you actually want to process the tokens returned by split, because the delimiter in this case isn't "stealing" anything from the tokens.

虽然使用 split 来计算是相当方便的,但它并不是最有效的(例如,它正在做各种工作来返回你丢弃的那些字符串)。事实上,正如你所说,你逐行计算意味着该模式也必须重新编译并丢弃每一行。

Though using split to count is rather convenient, it isn't the most efficient (e.g. it's doing all kinds of work to return those strings that you discard). The fact that as you said you're counting line-by-line means that the pattern would also have to be recompiled and thrown away every line.

一种更有效的方式将使用您之前使用的相同正则表达式并执行通常的 Pattern.compile while(matcher.find())count ++;

A more efficient way would be to use the same regex you did before and do the usual Pattern.compile and while (matcher.find()) count++;

这篇关于Java Counting字符串中单词出现次数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆