在C#中匹配两个大集合的字符串 [英] Matching Two Large Sets of Strings in C#

查看:636
本文介绍了在C#中匹配两个大集合的字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是这样的情况:

我有一个我作为一个字符串被刮的网页。

I have a webpage that I have scraped as a string.

我在MSSQL数据库中有几个字段。例如,汽车模型,它有一个ID和一个名称,如野马或公民。它是预先填充大多数型号的汽车。

I have several fields in a MSSQL database. For example, car model, it has an ID and a Name, such as Mustang or Civic. It is pre-filled with most models of car.

我想找到我的模型表中任何行的任何匹配。所以如果我在我的模型表中有公民,野马和E350,我想要找到我已经刮下的页面中的三个中的任何一个。

I want to find any match for any row in my models table. So if I have Civic, Mustang and E350 in my Model Table I want to find any occurance of any of the three on the page I have scraped.

什么是有效的在C#中做到这一点。我正在使用LINQ to SQL与db进行接口。

What is an efficient way to do this in C#. I am using LINQ to SQL to interface with the db.

创建所有模型的字典,标记页面和迭代令牌是有意义的吗?或者我应该循环使用令牌并使用WHERE子句,并询问数据库是否有匹配?

Does creating a dictionary of all models, tokenizing the page and iterating through the tokens make sense? Or should I just iterate through the tokens and use a WHERE clause and ask the database if there is a match?

    //Dictionary dic contains all models from the DB, with the name being the key and the id being the value...
    foreach(string pageToken in pageTokens)
    {
         if(dic.ContainsKey(pageToken)) 
         {
              //Do what I need to do
         }
    }

这两种方法对我来说都是可怕的。有什么建议我应该做什么有一些设置交叉点我想象可能会很好吗?

Both of these methods seem terrible to me. Any suggestions on what I should do? Something with set intersection I would imagine might be nice?

这两种方法都不能解决当一个模型名称不止一个字时会发生什么,如F150扩展驾驶室 。

Neither of these methods address what happens when a Model name is more than one word..like "F150 Extended Cab". Thoughts on that?

推荐答案

在较大的文本中搜索多个字符串是一个很好理解的问题,并且已经做出了显着的研究使其快速。这两种最流行和最有效的方法是 Aho-Corasick Algorithm (I' d推荐这个)和 Rabin-Karp算法。他们使用了一些预处理,但是数量级不及复杂。比naieve方法更快(naieve方法是最坏情况O(m * n ^ 2 * p)其中m是长串[你所刮的网页]的长度,n是针的平均长度,p是针数)。 Aho-Corsaik是线性的。 AC#的实现可以在CodeProject中免费找到。

Searching for multiple strings in a larger text is a well-understood problem, and signifigant research has been made into making it fast. The two most popular and effective methods for this are the Aho-Corasick Algorithm (I'd rcommend this one) and the Rabin-Karp Algorithm. They use a little preprocessing, but are orders of magnitude less complex & faster than the naieve method (the naieve method is worst-case O(m*n^2*p) where m is the length of the long string [the webpage you scraped] and n is the average length of the needles and p is the number of needles). Aho-Corsaik is linear. A C# implementation of it can be found at CodeProject for free.

编辑:哎呀,我对Aho-Corasick的复杂性错了 - 在数字&输入字符串的长度+正在分析的字符串的大小[被删除的文本]加上匹配的数量。但是它仍然是线性的,线性比立方更好: - )。

Oops, I was wrong about the complexity of Aho-Corasick -- it's linear in the number & length of input strings + the size of the string being analyzed [the scraped text] plus the number of matches. But it's still linear and linear is a lot better than cubic :-).

这篇关于在C#中匹配两个大集合的字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆