对大缓冲区进行标记 [英] Tokenizing a large buffer

查看:68
本文介绍了对大缓冲区进行标记的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



我有一个大的文本字符串,还有一堆正则表达式模式

我需要在字符串中查找。换句话说,在里面找到所有的b $ b'代币'。当然我可以使用正则表达式引擎,物品

,如果我有500种不同的模式,这意味着在

缓冲区上搜索500次。有没有人对如何加速这个问题有任何一般的想法

up?换句话说,我正在寻找做旧的''LEX''工具在unix

过去做的事情 - 一次传递字符串,找到所有模式并制作

他们的代币...我需要的不是那么复杂,所以我想知道

如果.net regexp可以一次性搜索manny模式。


,如果答案是否定的,那么.net中是否有任何lex实现? :)


谢谢

Jonathan

Hi,
I got a large text string, and a bunch of regular expression patterns
i need to find within the string. in other words, to find all the
''tokens'' inside it. of course I could use the regexp engine, the thing
is, if I got 500 different patterns, this means 500 searches on the
buffer. Does anybody has any general idea on how this could be sped
up? In other words, I am looking to do what the old ''LEX'' tool in unix
used to do - one pass on the string, finding all patterns and making
them tokens... what I need is not THAT complex, so I am wondering
if .net regexp could search for manny patterns in a single pass.

and if the answer is no - is there any lex implementation in .net? :)

Thanks
Jonathan

推荐答案

12月18日星期二2007 08:04:41 -0800,Jonathan Sion< yo ** @ nobhillsoft.com>

写道:
On Tue, 18 Dec 2007 08:04:41 -0800, Jonathan Sion <yo**@nobhillsoft.com>
wrote:

我有一个大文本字符串,以及一堆正则表达式模式

i需要在字符串中查找。换句话说,在里面找到所有的b $ b'代币'。当然我可以使用正则表达式引擎,物品

,如果我有500种不同的模式,这意味着在

缓冲区上搜索500次。有没有人对如何加速这个问题有任何一般的想法

up?换句话说,我正在寻找做旧的''LEX''工具在unix

过去做的事情 - 一次传递字符串,找到所有模式并制作

他们的代币...我需要的不是那么复杂,所以我想知道

如果.net regexp可以一次性搜索manny模式。


,如果答案是否定的,那么.net中是否有任何lex实现? :)
I got a large text string, and a bunch of regular expression patterns
i need to find within the string. in other words, to find all the
''tokens'' inside it. of course I could use the regexp engine, the thing
is, if I got 500 different patterns, this means 500 searches on the
buffer. Does anybody has any general idea on how this could be sped
up? In other words, I am looking to do what the old ''LEX'' tool in unix
used to do - one pass on the string, finding all patterns and making
them tokens... what I need is not THAT complex, so I am wondering
if .net regexp could search for manny patterns in a single pass.

and if the answer is no - is there any lex implementation in .net? :)



我不知道lex是如何实现的。而且我不知道状态机器是否是解决问题的最佳方法。但我确实知道这是一个解决问题的合理方法,而且我之前写了一个简单的

实现并在此处发布。你可以在这篇文章中看到它:
http://groups.google.com/group/micro...06f696d4500b77


我在实现中看到了一些我可能会做的事情做不同的

如果我今天再次这样做,但它应该工作。或者至少

类似的东西。


对于它的价值,这个大文本字符串有多大?你经常需要怎么样?
你需要这么做?如果您的代码需要经常反复执行,那么优化

实现将非常有用。但是,我猜想即使是一个价值100美元左右的100美元以上的搜索也不会花那么长时间,只需要做一次。


使用强力方法有一定的价值,因为它会使

代码保持简单。我不担心性能,除非你有一个很好的理由相信它会成为一个问题。


希望有所帮助。


Pete

I don''t know how lex was implemented. And I don''t know whether a state
machine is the best way to solve the problem. But I do know that it''s a
reasonable way to solve the problem, and that I wrote a simple
implementation awhile ago and posted it here. You can see it in this post:
http://groups.google.com/group/micro...06f696d4500b77

I see some things in the implementation that I''d probably do differently
if I were doing it again today, but it ought to work. Or at least
something like it.

For what it''s worth, just how large is this "large text string"? And how
frequently do you need to do this? If this is something that your code
needs to do over and over on a frequent basis, optimizing the
implementation would be useful. But I''d guess that 500 searches on even a
100K string or so wouldn''t take that long, just to do it once.

There''s some value in using the brute-force method, as it will keep the
code a _lot_ simpler. I wouldn''t worry about the performance unless you
have a good reason to believe it will be a problem.

Hope that helps.

Pete




如果你生成正则表达式的编译(构造函数参数)和缓存

它们在内存中,并确保每个正则表达式都锚定在搜索字符串开头的

,然后我不认为循环遍历所有

当你在文本中移动时,正则表达式会表现不佳。

由于正则表达式必须在文本开头匹配,所以它们可以

都失败非常快,如果他们不匹配。


简而言之,我会做一些测试,以确保你没有尝试

解决一个问题首先不存在的问题。


HTH,


Sam


- -------------------------------------------------- --------

我们正在招聘! B-Line Medical正在寻求.NET

开发人员在医疗产品中的激动人心的职位

开发MD / DC。在轻松的团队环境中使用各种技术

。在Dice.com上查看广告。


2007年12月18日星期二08:04:41 -0800(太平洋标准时间),Jonathan Sion

< yo * *@nobhillsoft.comwrote:

If you generate the regex''s as compiled (constructor param) and cache
them in memory, and make sure that each regex is anchored to the
begining of the search string, then I don''t think looping through all
the regexes as you move through the text will perform bad at all.
Since the regexes have to match at the start of the text then they can
all fail very fast if they don''t match.

In short, I would do some testing to make sure you''re not trying to
solve a problem that doesn''t exist first.

HTH,

Sam

------------------------------------------------------------
We''re hiring! B-Line Medical is seeking .NET
Developers for exciting positions in medical product
development in MD/DC. Work with a variety of technologies
in a relaxed team environment. See ads on Dice.com.

On Tue, 18 Dec 2007 08:04:41 -0800 (PST), Jonathan Sion
<yo**@nobhillsoft.comwrote:

>
我有一个大的文本字符串,还有一堆正则表达式模式
我需要在字符串中查找。换句话说,找到里面所有的令牌。当然我可以使用正则表达式引擎,事情
是,如果我有500种不同的模式,这意味着在
缓冲区上搜索500次。有没有人对如何加速这个问题有任何一般的想法?换句话说,我希望做的是unix
过去做的旧''LEX''工具 - 一次传递字符串,找到所有模式并制作它们的标记......我是什么需要不是那么复杂,所以我想知道
如果.net regexp可以一次性搜索manny模式。

如果答案是否定的 - 是否有任何lex实现。净? :)

谢谢
Jonathan
>Hi,
I got a large text string, and a bunch of regular expression patterns
i need to find within the string. in other words, to find all the
''tokens'' inside it. of course I could use the regexp engine, the thing
is, if I got 500 different patterns, this means 500 searches on the
buffer. Does anybody has any general idea on how this could be sped
up? In other words, I am looking to do what the old ''LEX'' tool in unix
used to do - one pass on the string, finding all patterns and making
them tokens... what I need is not THAT complex, so I am wondering
if .net regexp could search for manny patterns in a single pass.

and if the answer is no - is there any lex implementation in .net? :)

Thanks
Jonathan


On Tue,2007年12月18日10:38:24 -0800,Samuel R. Neff< sa ******** @ nomail.com>

写道:
On Tue, 18 Dec 2007 10:38:24 -0800, Samuel R. Neff <sa********@nomail.com>
wrote:

>

如果你生成正则表达式编译(构造函数参数)并缓存

它们在内存中,并确保每个正则表达式都锚定到

开始搜索字符串,然后我不认为循环遍历所有

正则表达式,因为你在文本中移动会表现不好。

因为正则表达式必须在文本开头匹配然后他们可以

如果他们不匹配就会很快失败。


简而言之,我会做一些测试,以确保你没有尝试

解决一个先不存在的问题。
>
If you generate the regex''s as compiled (constructor param) and cache
them in memory, and make sure that each regex is anchored to the
begining of the search string, then I don''t think looping through all
the regexes as you move through the text will perform bad at all.
Since the regexes have to match at the start of the text then they can
all fail very fast if they don''t match.

In short, I would do some testing to make sure you''re not trying to
solve a problem that doesn''t exist first.



此外,正则表达式匹配字符串可以多长时间?假设没有

实际限制 - 也就是说,你可以在那里添加任意字符串 - 然后

因为Regex支持boolean或在搜索模式字符串中,你可以

只有一个包含所有标记的搜索模式字符串。


所以不要循环多次搜索使用Regex,只需在创建搜索字符串时循环使用
代币。然后让正则表达式完成所有艰难的工作。


这会比状态图快或快吗?我不知道......

取决于正则表达式作者是否付出了一些努力来优化

案例。我不太了解Regex(实现_or_ API :))

有答案。但即使他们没有,显然Sam和我

同意更简单的代码更好,只要没有直接的证据表明性能实际上是将成为一个问题。


Pete

In addition, how long can a Regex match string be? Assuming there''s no
practical limit -- that is, you can put any arbitrary string there -- then
since Regex supports boolean "or" in the search pattern string, you could
just have a single search pattern string with all of the tokens in it.

So rather than looping with multiple searches using Regex, just loop on
the tokens in creating the search string. Then let Regex do all the hard
work.

Would this be as fast or faster than the state graph? I don''t know...it
depends on whether the Regex authors put some effort into optimizing that
case. I don''t know enough about Regex (implementation _or_ API :) ) to
have an answer to that. But even if they didn''t, obviously Sam and I
agree that the simpler code is better as long as there''s no direct
evidence that performance is actually going to be an issue.

Pete


这篇关于对大缓冲区进行标记的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆