检查名称列表是否出现在文本正文中。 [英] Checking if a list of names appears in a body of text.

查看:59
本文介绍了检查名称列表是否出现在文本正文中。的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有公司名称列表(例如,IBM,康宁,通用汽车和

另外5,000个)。


如果我拿例如,一个文本正文,一篇新闻文章,我想看看这个公司名称出现在那个文本中,是否有一种有效的方式

这样做?


我想过循环遍历名称数组,并且进行

IndexOf或Regex匹配,但这种方法很慢。然后我想到了

一个数组交集,但是对于双字公司

名称来说这是有问题的(你不能只根据拆分创建第二个数组)

空格。


任何提示都会非常感激!


--Brent

I have a list of company names (say, IBM, Corning, General Motors, and
another 5,000 of them).

If I take a body of text, a news article, for instance, and I want to
see which company names appear in that text, is there an efficient way
to do this?

I thought about looping through the array of names, and doing an
IndexOf or Regex match, but this method is slow. Then I thought about
an array intersection, but this is problematic for two-word company
names (you can''t just create the second array based on a split on
spaces).

Any hints would be much appreciated!

--Brent

推荐答案

2008年5月2日星期五17:23:21 -0700,Brent< wr ******** @ gmail.comwrote:
On Fri, 02 May 2008 17:23:21 -0700, Brent <wr********@gmail.comwrote:

我有公司名称列表(比如IBM,Corning,通用汽车公司和

另外5,000个公司名称)。


如果我拿一段文本,例如一篇新闻文章,我想看看哪个公司名称出现在那个文本中,是否有一种有效的方式

这样做?
I have a list of company names (say, IBM, Corning, General Motors, and
another 5,000 of them).

If I take a body of text, a news article, for instance, and I want to
see which company names appear in that text, is there an efficient way
to do this?



经常出现这种情况。 :)


对于一般情况,到目前为止我还没有看到有人建议我提出的更好的解决方案过去:使用状态图。

这种方法假设你有一个静态列表,你可以使用

初始化状态图一次,然后使用相同的图表来处理

多次输入。如果你必须为每次输入迭代创建它,那么创建图形的成本将是非常高的。


在早期的一个帖子中,我发布了一些代码这提供了状态图的通用

实现。我并没有声称它是最好的实现b / b
,但确实有效。 :)您可以在此处找到该消息:
http://groups.google.com/group/micro...06f696d4500b77


非常重要!该代码中存在性能错误,严重地限制了它的实用性。原来我发布的代码永远不会因为评论而烦恼,所以直到很久以后我都没注意到,

我建议使用相同的代码别人和他们抱怨说它没有我说的应该是那么快。您可以找到我的后续帖子

,其中我在此处包含了早期代码的错误修复:
http://groups.google.com/group/micro...50505b568a75fd


如果你关心性能(显然你做:)),不要使用我最初发布的

代码而不包括后面的bug修复。


您可能希望查看两个主题以获取其他人的评论。一个

的其他人为这些主题做出了贡献,并且他们有很有见地的

和有用的评论,特别是关于特殊情况你可以

通过了解输入的特定内容来获得良好的性能。

状态图表作为通用解决方案表现良好,但有时候你可以使用不同的解决方案获得相同或更好的性能

输入的一些特殊已知特征的优点。


Pete

This comes up here surprisingly often. :)

For the general case, so far I''ve yet to see someone suggest a better
solution that the one I''ve proposed in the past: using a state graph.
This approach assumes that you''ve got a static list that you can use to
initialize your state graph once, and then use the same graph to process
multiple input over and over. The cost of creating the graph would be
prohibitive if you had to create it for each iteration of input.

In an earlier thread, I posted some code that provided a generic
implementation of a state graph. I''m not claiming it''s the best
implementation, but it does work. :) You can find that message here:
http://groups.google.com/group/micro...06f696d4500b77

VERY IMPORTANT! That code had a performance bug in it that severely
limited its usefulness. The original person I''d posted the code for never
bothered to comment on it, so I never even noticed until much later, when
I recommended the same code to someone else and they complained that it
wasn''t as fast as I''d said it should be. You can find my follow-up post
in which I included the bug-fix for the earlier code here:
http://groups.google.com/group/micro...50505b568a75fd

If you care about performance (and obviously you do :) ), do NOT use the
code I originally posted without including the later bug-fix.

You may want to review both threads for comments from other people. A
number of other folks contributed to the threads, and they had insightful
and useful comments, especially pertaining to special cases where you can
get good performance by knowing something particular about the input. The
state graph performs well as a general-purpose solution, but sometimes you
can get equal or better performance with a different solution that takes
advantage of some particular known characteristic of the input.

Pete


非常有趣的概念,彼得。我会在周末玩这个。

彼得

Peter Duniho < Np ********* @nnowslpianmk.com在留言中写道

news:op *************** @ petes-computer.local ...
Very interesting concept, Peter. I''ll be playing with this over the weekend.
Peter
"Peter Duniho" <Np*********@nnowslpianmk.comwrote in message
news:op***************@petes-computer.local...

周五,2008年5月2日17:23:21 -0700,Brent< wr ******** @ gmail.comwrote:
On Fri, 02 May 2008 17:23:21 -0700, Brent <wr********@gmail.comwrote:

>我有一个公司名称列表(比如IBM,Corning,General Motors,以及另外5,000个公司名称)。

如果我拿一段文字,一篇新闻文章,例如,我想看看哪个公司名称出现在那个文本中,是否有一种有效的方法来做到这一点?
>I have a list of company names (say, IBM, Corning, General Motors, and
another 5,000 of them).

If I take a body of text, a news article, for instance, and I want to
see which company names appear in that text, is there an efficient way
to do this?



经常出现这种情况。 :)


对于一般情况,到目前为止我还没有看到有人建议我提出的更好的解决方案过去:使用状态图。

这种方法假设你有一个静态列表,你可以使用

初始化状态图一次,然后使用相同的图表来处理

多次输入。如果你必须为每次输入迭代创建它,那么创建图形的成本将是非常高的。


在早期的一个帖子中,我发布了一些代码这提供了状态图的通用

实现。我并没有声称它是最好的实现b / b
,但确实有效。 :)您可以在此处找到该消息:
http://groups.google.com/group/micro...06f696d4500b77


非常重要!该代码中存在性能错误,严重地限制了它的实用性。原来我发布的代码永远不会因为评论而烦恼,所以直到很久以后我都没注意到,

我建议使用相同的代码别人和他们抱怨说它没有我说的应该是那么快。您可以找到我的后续帖子

,其中我在此处包含了早期代码的错误修复:
http://groups.google.com/group/micro...50505b568a75fd


如果你关心性能(显然你做:)),不要使用我最初发布的

代码而不包括后面的bug修复。


您可能希望查看两个主题以获取其他人的评论。一个

的其他人为这些主题做出了贡献,并且他们有很有见地的

和有用的评论,特别是关于特殊情况你可以

通过了解输入的特定内容来获得良好的性能。

状态图表作为通用解决方案表现良好,但有时候你可以使用不同的解决方案获得相同或更好的性能

输入的某些特定已知特征的优点。


Pete


This comes up here surprisingly often. :)

For the general case, so far I''ve yet to see someone suggest a better
solution that the one I''ve proposed in the past: using a state graph.
This approach assumes that you''ve got a static list that you can use to
initialize your state graph once, and then use the same graph to process
multiple input over and over. The cost of creating the graph would be
prohibitive if you had to create it for each iteration of input.

In an earlier thread, I posted some code that provided a generic
implementation of a state graph. I''m not claiming it''s the best
implementation, but it does work. :) You can find that message here:
http://groups.google.com/group/micro...06f696d4500b77

VERY IMPORTANT! That code had a performance bug in it that severely
limited its usefulness. The original person I''d posted the code for never
bothered to comment on it, so I never even noticed until much later, when
I recommended the same code to someone else and they complained that it
wasn''t as fast as I''d said it should be. You can find my follow-up post
in which I included the bug-fix for the earlier code here:
http://groups.google.com/group/micro...50505b568a75fd

If you care about performance (and obviously you do :) ), do NOT use the
code I originally posted without including the later bug-fix.

You may want to review both threads for comments from other people. A
number of other folks contributed to the threads, and they had insightful
and useful comments, especially pertaining to special cases where you can
get good performance by knowing something particular about the input. The
state graph performs well as a general-purpose solution, but sometimes you
can get equal or better performance with a different solution that takes
advantage of some particular known characteristic of the input.

Pete


Brent写道:
Brent wrote:

我有一个公司名称列表(例如,IBM,Corning,通用汽车公司,以及

另外5,000个公司名称)。


如果我拿一大段文本,例如一篇新闻文章,我想看看哪个公司名称出现在那个文本中,是否有一种有效的方式

要做到这一点吗?


我想过循环遍历一系列名称,然后做一个

的IndexOf或Regex匹配,但是这个方法是慢的。然后我想到了

一个数组交集,但是对于双字公司

名称来说这是有问题的(你不能只根据拆分创建第二个数组)

空格)。
I have a list of company names (say, IBM, Corning, General Motors, and
another 5,000 of them).

If I take a body of text, a news article, for instance, and I want to
see which company names appear in that text, is there an efficient way
to do this?

I thought about looping through the array of names, and doing an
IndexOf or Regex match, but this method is slow. Then I thought about
an array intersection, but this is problematic for two-word company
names (you can''t just create the second array based on a split on
spaces).



尝试类似:


public static string [] Find2(string [] lst,string txt)

{

HashSet< stringhs = new HashSet< string>(txt.Split('''','','',''。''));

返回Array.FindAll< string>(lst,(string s)= hs.Contains(s));

}


假设你在3.5岁!


Arne

Try something like:

public static string[] Find2(string[] lst, string txt)
{
HashSet<stringhs = new HashSet<string>(txt.Split('' '', '','', ''.''));
return Array.FindAll<string>(lst, (string s) =hs.Contains(s));
}

assuming you are on 3.5 !

Arne


这篇关于检查名称列表是否出现在文本正文中。的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆