快速确定字符串是否包含字符串列表成员的方法 [英] Fast way to determine if a string contains a member of a list of strings

查看:90
本文介绍了快速确定字符串是否包含字符串列表成员的方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要找到存储和搜索方面最快的方法

确定给定字符串是否包含字符串列表中的一个成员。

所以从这个角度考虑它:我有一个如下字符串:


史密斯说这只狗属于(或确实属于)秘书,一名员工

公司。


然后该列表包含以下内容(此列表在

实际情况中更大) :


亚当斯

艾伦

琼斯

雅各布斯

史密斯

Thaxton

Vela

Zulu


我需要停止处理并返回(true )一旦史密斯找到了b $ b。另一方面,如果字符串被更改为以下内容,则

将不匹配,我将返回(false):


" Smitherson说狗是属于(或确实属于)秘书,公司的一个

员工。


在给定的字符串中,确实知道匹配应该从给定的点开始

(零位),但我需要继续处理,直到我用尽了列表中的

候选字符串 - 如上所示 - 防止错误匹配。


我玩了一些不同的数据结构,比如前缀和

后缀树,这些工作就是你的情况有一个字符串,你想要在列表中匹配
,反之亦然。这种方法需要非常高效,因为它将被评估数百万次。我也是

还可以使用不安全的代码方法。我只需要尽快终止评估

,而不是循环遍历列表中的每一项

。即使是IndexOf太慢了。

I need to find the fastest way in terms of storage and searching to
determine if a given string contains one of a member of a list of strings.
So, think of it in terms of this: I have a string such as the following:

"Smith said the dog belongs (or did belong) to the secretary, an employee of
the company."

Then the list contains the following (and this list is MUCH larger in the
real situation):

Adams
Alan
Jones
Jacobs
Smith
Thaxton
Vela
Zulu

I would need to stop the processing and return (true) as soon as Smith was
found. On the other hand, if the string was changed to the following, there
would be no match and I would return (false):

"Smitherson said the dog belongs (or did belong) to the secretary, an
employee of the company."

In the given string, do know that the matches should begin at a given point
(zero position), but I need to keep processing until I have exhausted the
candidate string in the list - as shown above - to prevent a false match.

I have played around with some different data structures, such as prefix and
suffix trees, an these work in the case that you have a string that you are
trying to match in a list, not vice versa. The approach is required to be
very performant because it will be evaluated millions of times. I am also
okay with an unsafe code approach that works. I just need the evaluations to
terminate as soon as possible rather than looping through every single item
in the list. Even an IndexOf is too slow.

推荐答案

我认为RegEx会帮助你。至于没有遍历

整个列表,我不知道你如何有选择地不会因为

唯一匹配列表是最后一项在列表中或当没有

匹配时。

" Karch" < news.microsoft.com写信息

新闻:e9 ************** @ TK2MSFTNGP06.phx.gbl ...
I think RegEx will help you out here. As far as not iterating over the
entire list, I''ve no clue how you can selectively not fall through given the
only match in the list is the last item in the list or when there is no
match at all.
"Karch" <news.microsoft.comwrote in message
news:e9**************@TK2MSFTNGP06.phx.gbl...

>我需要在存储和搜索方面找到最快的方法来确定给定的字符串是否包含字符串列表中的一个。
所以从这个角度考虑一下:我有一个字符串如下:


史密斯说这只狗属于(或确实属于)秘书,一名员工

该公司。


然后该列表包含以下内容(此列表在

实际情况中更大) :


亚当斯

艾伦

琼斯

雅各布斯

史密斯

Thaxton

Vela

Zulu


我需要停止处理并返回(true )一旦史密斯找到了b $ b。另一方面,如果字符串被更改为以下,

将没有匹配,我将返回(false):


" Smitherson说狗是属于(或确实属于)秘书,公司的一个

员工。


在给定的字符串中,确实知道匹配应该从给定的

点(零位置)开始,但是我需要继续处理,直到我有

用尽了列表中的候选字符串 - 如上所示 - 防止

错误匹配。


我玩了一些不同的数据结构,比如前缀

和后缀树,如果你有一个字符串

,你试图在列表中匹配,而不是相反,这些工作。这种方法需要非常高效,因为它将被评估数百万美元的b $ b次。我也可以使用不安全的代码方法。我只需要

评估尽快终止,而不是通过列表中的每一项循环

。甚至IndexOf太慢了。
>I need to find the fastest way in terms of storage and searching to
determine if a given string contains one of a member of a list of strings.
So, think of it in terms of this: I have a string such as the following:

"Smith said the dog belongs (or did belong) to the secretary, an employee
of the company."

Then the list contains the following (and this list is MUCH larger in the
real situation):

Adams
Alan
Jones
Jacobs
Smith
Thaxton
Vela
Zulu

I would need to stop the processing and return (true) as soon as Smith was
found. On the other hand, if the string was changed to the following,
there would be no match and I would return (false):

"Smitherson said the dog belongs (or did belong) to the secretary, an
employee of the company."

In the given string, do know that the matches should begin at a given
point (zero position), but I need to keep processing until I have
exhausted the candidate string in the list - as shown above - to prevent a
false match.

I have played around with some different data structures, such as prefix
and suffix trees, an these work in the case that you have a string that
you are trying to match in a list, not vice versa. The approach is
required to be very performant because it will be evaluated millions of
times. I am also okay with an unsafe code approach that works. I just need
the evaluations to terminate as soon as possible rather than looping
through every single item in the list. Even an IndexOf is too slow.



Karch写道:
Karch wrote:

I需要在存储和搜索方面找到最快的方式

确定给定的字符串是否包含字符串列表中的一个成员。

所以,想一想在这方面:我有一个如下字符串:


史密斯说这只狗属于(或确实属于)秘书,一名员工

公司。


然后该列表包含以下内容(此列表在

实际情况中更大):


亚当斯

艾伦

琼斯

雅各布斯

史密斯

Thaxton

Vela

Zulu
I need to find the fastest way in terms of storage and searching to
determine if a given string contains one of a member of a list of strings.
So, think of it in terms of this: I have a string such as the following:

"Smith said the dog belongs (or did belong) to the secretary, an employee of
the company."

Then the list contains the following (and this list is MUCH larger in the
real situation):

Adams
Alan
Jones
Jacobs
Smith
Thaxton
Vela
Zulu



你能告诉我们一些可能的长度吗?列表?并且

无论是静态的还是动态的?


jd

Could you give us some idea of the likely length of the list? And
whether it''s static or dynamic ?

jd


05月3日2008年3月09:05:22 -0800,Karch< news.microsoft.com写道:
On Wed, 05 Mar 2008 09:05:22 -0800, Karch <news.microsoft.comwrote:

我需要找到最快的存储和搜索方式

确定给定字符串是否包含

字符串列表中的一个成员。
I need to find the fastest way in terms of storage and searching to
determine if a given string contains one of a member of a list of
strings.



我不知道RegEx是否有优化处理这种

的事情。它可以,但考虑到RegEx的搜索字符串可以是多久,如果你连接所有可能的匹配,也许它不会。


那将是我的第一次尝试。看起来像实际的RegEx搜索

字符串会很简单(只是或所有可能的匹配在一起)。

如果它表现得足够好,那就去吧。


如果没有,我猜这里有很多关于算法的研究,比如

这个,但是一个相当基本的方法是一个州图形。它的记忆力很强,但奖励你的表现很好。这出现了一段时间之前

我发布了一些示例代码以启动其他人。您可以在此处找到


http://groups.google.com/group/micro...06f696d4500b77


我已提到这个帖子有几次,虽然我从来没有说过b $ b有人说它实际上证明它有用,但他们从来没有说过它

wasn不是。 :)


我想如果我要继续推荐人们,也许我应该将b $ b清理一下。但是那里有什么用,应该足够让你指向正确的方向。


Pete

I don''t know if RegEx has optimizations for dealing with this sort of
thing. It could, but given how long a search string for RegEx could be if
you concatenate all of your possible matches, maybe it doesn''t.

That''d be my first try though. It seems like the actual RegEx search
string would be simple (just "or" all of your possible matches together).
If it performs well enough, there you go.

If not, I''d guess there''s a fair amount of research into algorithms like
this, but a fairly basic approach is a state graph. It''s memory
intensive, but rewards you with good performance. This came up awhile ago
and I posted some sample code to get someone else started. You can find
that post here:
http://groups.google.com/group/micro...06f696d4500b77

I''ve referred to this post a couple of other times, and while I''ve never
had anyone say it actually turned out to be useful, they''ve never said it
wasn''t either. :)

I suppose if I''m going to keep referring people to it, maybe I ought to
clean it up a little bit more. But what''s there does work and should be
enough to get you pointed in the right direction.

Pete


这篇关于快速确定字符串是否包含字符串列表成员的方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆