字符串搜索与regexp搜索 [英] String search vs regexp search

查看:57
本文介绍了字符串搜索与regexp搜索的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

要搜索一组单词中的单词,例如段落或网页,

字符串搜索或正则表达式搜索会更快吗?


字符串搜索当然是,


如果str.find(substr)!= -1:

domything()


假设没有案例限制的正则表达式搜索将是,


strre = re.compile(substr,re.IGNORECASE)


m = strre.search(str)

如果m:

domything()


我即将做一个测试,然后我觉得这里有人可能已经有一些关于这个的数据了。


谢谢大家!


-Anan

To search a word in a group of words, say a paragraph or a web page,
would a string search or a regexp search be faster?

The string search would of course be,

if str.find(substr) != -1:
domything()

And the regexp search assuming no case restriction would be,

strre=re.compile(substr, re.IGNORECASE)

m=strre.search(str)
if m:
domything()

I was about to do a test, then I thought someone here might have
some data on this already.

Thanks folks!

-Anan

推荐答案

py **** ***@Hotpop.com (Anand Pillai)写道:

新闻:84 ********************** ***@posting.google.co m:
py*******@Hotpop.com (Anand Pillai) wrote in
news:84*************************@posting.google.co m:
要搜索一组单词中的单词,例如段落或网页,
会进行字符串搜索或regexp搜索更快?

字符串搜索当然是,

如果str.find(substr)!= -1:
domything()<并且,假设没有案例限制的正则表达式搜索将是,

strre = re.compile(substr,re.IGNORECASE)

m = strre。搜索(str)
如果m:
domything()

我正要做一个测试,然后我觉得这里有人可能已经有了一些这方面的数据。
To search a word in a group of words, say a paragraph or a web page,
would a string search or a regexp search be faster?

The string search would of course be,

if str.find(substr) != -1:
domything()

And the regexp search assuming no case restriction would be,

strre=re.compile(substr, re.IGNORECASE)

m=strre.search(str)
if m:
domything()

I was about to do a test, then I thought someone here might have
some data on this already.



是的。答案是这一切都取决于。


它依赖的东西包括:


你的两位代码做了不同的事情,一个是区分大小写的,一个

忽略大小写。您需要哪个?


您要搜索的字符串多长时间?子字符串有多长?


每次子字符串是否相同,或者您总是在搜索不同字符串的
。子字符串是否包含正则表达式具有特殊

含义的字符?


正则表达式代码具有启动惩罚,因为它必须编译

正则表达式至少一次,但实际搜索可能比天真的str.find更快。如果与编译时相比,搜索所花费的时间足够长,那么常规的

表达式可能会胜出。


底线:编写代码使其尽可能干净和可维护。

如果你有时间并且知道你的

搜索,那么只关心优化它是一个瓶颈。


-

Duncan Booth du * ***@rcp.co.uk

int month(char * p){return(124864 /((p [0] + p [1] -p [2]& ; 0x1f)+1)%12)[" \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\ \\\\4"];} //谁说我的代码模糊不清?


Yes. The answer is ''it all depends''.

Things it depends on include:

Your two bits of code do different things, one is case sensitive, one
ignores case. Which did you need?

How long is the string you are searching? How long is the substring?

Is the substring the same every time, or are you always searching for
different strings. Can the substring contain characters with special
meanings for regular expressions?

The regular expression code has a startup penalty since it has to compile
the regular expression at least once, however the actual searching may be
faster than the naive str.find. If the time spent doing the search is
sufficiently long compared with the time doing the compile, the regular
expression may win out.

Bottom line: write the code so it is as clean and maintainable as possible.
Only worry about optimising this if you have timed it and know that your
searches are a bottleneck.

--
Duncan Booth du****@rcp.co.uk
int month(char *p){return(124864/((p[0]+p[1]-p[2]&0x1f)+1)%12)["\5\x8\3"
"\6\7\xb\1\x9\xa\2\0\4"];} // Who said my code was obscure?


抱歉太简短了!

我在谈论一个函数,它使用字符串和放大器计算数字

的出现次数。正则表达式。


我为正则表达式搜索以及函数

搜索编写了代码并在相当大的文件(800 KB)上对其进行了测试

出现某个单词。我发现字符串搜索

比使用regexp的字符串快至少2倍,不包括

regexp.compile()方法的时间。当文件变得非常大并且单词是

展开时,这尤其值得注意。


我也认为正则表达式会击败弦乐大拇指而且我会因为相反的结果感到惊讶。


这是代码。请注意,我使用''count''方法

计算出现次数而不是''find''方法。


#Test找出是否在数据中搜索字符串

#比正则表达式搜索更快。


#结果:字符串搜索速度快得多

#对子串的很多次出现。


导入时间


def strsearch1(s,substr):


t1 = time.time()

打印''计数1 =>'',s.count(substr)

t2 = time.time()

print''使用字符串搜索,Time time => '',t2 - t1


def strsearch2(s,substr):


导入重新


r = re.compile(substr,re.IGNORECASE)

t1 = time.time()

print''Count 2 =>'',len(r .findall(s))

t2 = time.time()

print''使用regexp搜索,Time time => '',t2 - t1

data = open(" test.html"," r")。read()

strsearch1(data," Miriam" )

strsearch2(数据,Miriam)


#此处输出...


D: \Programming\python> python strsearch.py​​

Count 1 => 45

使用字符串搜索,Time time => 0.0599999427795

计数2 => 45

使用regexp搜索,Time time => 0.110000014305

测试是在使用Python 2.3的Windows 98机器上完成的,在248 MB RAM,Intel 1.7 GHz芯片组上运行




我正在考虑在我的代码中使用正则表达式搜索,但这说服

我坚持使用旧的字符串搜索。


感谢您的回复。


-Anand


Duncan Booth< du **** @ NOSPAMrcp.co.uk>在消息新闻中写道:< Xn *************************** @ 127.0.0.1> ...
Sorry for being too brief!

I was talking about a function which ''counts'' the number
of occurences using string & regexp.

I wrote the code for the regexp search as well as the function
search and tested it on a rather large file (800 KB) for
occurences of a certain word. I find that the string search
is at least 2 times faster than the one with regexp, excluding
the time for the regexp.compile() method. This is particularly
noticeable when the file becomes quite large and the word is
spread out.

I also thought the regexp would beat string thumbs down and I
am suprised at the result that it is the other way around.

Here is the code. Note that I am using the ''count'' methods that
count the number of occurences rather than the ''find'' methods.

# Test to find out whether string search in a data
# is faster than regexp search.

# Results: String search is much faster when it comes
# to many occurences of the sub string.

import time

def strsearch1(s, substr):

t1 = time.time()
print ''Count 1 =>'', s.count(substr)
t2 = time.time()
print ''Searching using string, Time taken => '', t2 - t1

def strsearch2(s, substr):

import re

r=re.compile(substr, re.IGNORECASE)
t1 = time.time()
print ''Count 2 =>'', len(r.findall(s))
t2 = time.time()
print ''Searching using regexp, Time taken => '', t2 - t1
data=open("test.html", "r").read()
strsearch1(data, "Miriam")
strsearch2(data, "Miriam")

# Output here...

D:\Programming\python>python strsearch.py
Count 1 => 45
Searching using string, Time taken => 0.0599999427795
Count 2 => 45
Searching using regexp, Time taken => 0.110000014305

Test was done on a windows 98 machine using Python 2.3, running
on 248 MB RAM, Intel 1.7 GHz chipset.

I was thinking of using regexp searches in my code, but this convinces
me to stick on to the good old string search.

Thanks for the replies.

-Anand

Duncan Booth <du****@NOSPAMrcp.co.uk> wrote in message news:<Xn***************************@127.0.0.1>...
py*******@ Hotpop.com (Anand Pillai)在新闻中写道:84 ************************* @ posting.google.co m:
py*******@Hotpop.com (Anand Pillai) wrote in
news:84*************************@posting.google.co m:
搜索a在一组单词中说出一个段落或一个网页,
字符串搜索或正则表达式搜索会更快吗?

字符串搜索当然是,

如果str.find(substr)!= -1:
domything()

并且假设没有案例限制的正则表达式搜索将是,

strre = re.compile(substr,re.IGNORECASE)

m = strre.search(str)
如果m:
domything()

我正要做一个测试,然后我觉得这里有人可能已经有了一些数据。
To search a word in a group of words, say a paragraph or a web page,
would a string search or a regexp search be faster?

The string search would of course be,

if str.find(substr) != -1:
domything()

And the regexp search assuming no case restriction would be,

strre=re.compile(substr, re.IGNORECASE)

m=strre.search(str)
if m:
domything()

I was about to do a test, then I thought someone here might have
some data on this already.


是的。答案是这完全取决于。

它依赖的东西包括:

你的两位代码做不同的事情,一个是区分大小写的,一个是
您正在搜索的字符串有多长?子串多长时间?

每次子串是否相同,或者您总是在寻找不同的字符串。子字符串是否包含具有正则表达式特殊含义的字符?

正则表达式代码具有启动惩罚,因为它必须至少编译一次正则表达式,但是实际搜索可能比天真的str.find更快。如果与编译时相比,搜索所花费的时间足够长,那么常规的
表达式可能会胜出。

底线:编写代码使其成为尽可能干净和可维护。
如果你有时间并且知道你的搜索是瓶颈,那么只关心优化它。


Yes. The answer is ''it all depends''.

Things it depends on include:

Your two bits of code do different things, one is case sensitive, one
ignores case. Which did you need?

How long is the string you are searching? How long is the substring?

Is the substring the same every time, or are you always searching for
different strings. Can the substring contain characters with special
meanings for regular expressions?

The regular expression code has a startup penalty since it has to compile
the regular expression at least once, however the actual searching may be
faster than the naive str.find. If the time spent doing the search is
sufficiently long compared with the time doing the compile, the regular
expression may win out.

Bottom line: write the code so it is as clean and maintainable as possible.
Only worry about optimising this if you have timed it and know that your
searches are a bottleneck.



Duncan Booth< du **** @ NOSPAMrcp.co.uk>在消息新闻中写道:< Xn *************************** @ 127.0.0.1> ...
Duncan Booth <du****@NOSPAMrcp.co.uk> wrote in message news:<Xn***************************@127.0.0.1>...
正则表达式代码具有启动惩罚,因为它必须至少编译一次正则表达式,但实际搜索可能比天真的str.find更快。如果与编译时相比,搜索所花费的时间足够长,那么常规
表达式可能会胜出。
The regular expression code has a startup penalty since it has to compile
the regular expression at least once, however the actual searching may be
faster than the naive str.find. If the time spent doing the search is
sufficiently long compared with the time doing the compile, the regular
expression may win out.




两者正则表达式搜索和string.find会一次搜索

一个字符;鉴于此,对我来说,似乎不可能用手写的C语言天真来表示。 string.find可能比机器翻译编码的Python正则表达式搜索慢。

编译时间仅用于进一步增加string.find''

优势。


Jeremy



Both regular expression searching and string.find will do searching
one character at a time; given that, it seems impossible to me that
the hand-coded-in-C "naive" string.find could be slower than the
machine-translated-coded-in-Python regular expression search.
Compilation time only serves to further increase string.find''s
advantage.

Jeremy


这篇关于字符串搜索与regexp搜索的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆