在 Perl 中计算字符串中单词数的最快方法是什么? [英] What's the fastest way to count the number of words in a string in Perl?

查看:47
本文介绍了在 Perl 中计算字符串中单词数的最快方法是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有几个函数在各种文本上运行了超过一百万次,这意味着这些函数的微小改进转化为总体上的巨大收益.目前,我注意到我的所有涉及字数统计的函数都比其他所有函数的运行时间要长得多,所以我想我想尝试以不同的方式进行字数统计.

I have a few functions that I'm running over a million times on various texts, which means small improvements in these functions translate to big gains overall. Currently, I've noticed that all of my functions which involve word counts take drastically longer to run than everything else, so I'm thinking I want to try doing word count in a different way.

基本上,我的函数所做的是抓取一些与文本相关联的对象,验证该文本与某些模式不匹配,然后计算该文本中的单词数.该函数的基本版本是:

Basically, what my function does is grab a number of objects that have text associated with them, verify that that text doesn't match certain patterns, and then count the number of words in that text. A basic version of the function is:

my $num_words = 0;
for (my $i=$begin_pos; $i<=$end_pos; $i++) {
   my $text = $self->_getTextFromNode($i);
   #If it looks like a node full of bogus text, or just a number, remove it.
   if ($text =~ /^\s*\<.*\>\s*$/ && $begin_pos == $end_pos) { return 0; }
   if ($text =~ /^\s*(?:Page\s*\d+)|http/i && $begin_pos == $end_pos) { return 0; }
   if ($text =~ /^\s*\d+\s*$/ && $begin_pos == $end_pos) { return 0; }
   my @text_words = split(/\s+/, $text);
   $num_words += scalar(@text_words);
   if ($num_words > 30) { return 30; }
}
return $num_words;
}

我正在做大量的文本比较,类似于我在代码中的其他地方所做的,所以我猜我的问题一定是我的字数统计.有没有比在 \s+ 上拆分更快的方法?如果是这样,它是什么以及为什么它更快(这样我就可以理解我做错了什么,并可以在以后将这些知识应用于类似的问题).

I'm doing plenty of text comparisons similar to what I'm doing here elsewhere in my code, so I'm guessing my problem must be with my word counting. Is there a faster way to do it than splitting on \s+? If so, what is it and why is it faster (so I can understand what I'm doing wrong and can apply that knowledge to similar problems later on).

推荐答案

使用带有正则表达式的 while 循环是我发现的计算字数的最快方法:

Using a while loop with a regex is the fastest way that I have found to count words:

my $text = 'asdf asdf asdf asdf asdf';

sub count_array {
   my @text_words = split(/\s+/, $text);
   scalar(@text_words);
}

sub count_list {
    my $x =()= $text =~ /\S+/g;       #/
}

sub count_while {
    my $num; 
    $num++ while $text =~ /\S+/g;     #/
    $num
}

say count_array; # 5
say count_list;  # 5
say count_while; # 5

use Benchmark 'cmpthese';

cmpthese -2 => {
    array => \&count_array,
    list  => \&count_list,
    while => \&count_while,
}

#          Rate  list array while
# list  303674/s    --  -22%  -55%
# array 389212/s   28%    --  -42%
# while 675295/s  122%   74%    --

while 循环更快,因为不需要为每个找到的单词分配内存.此外,正则表达式在布尔上下文中,这意味着它不需要从字符串中提取实际匹配项.

The while loop is faster because memory does not need to be allocated for each of the found words. Also the regex is in boolean context which means it does not need to extract the actual match from the string.

这篇关于在 Perl 中计算字符串中单词数的最快方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆