字符串匹配的两个大集在C# [英] Matching Two Large Sets of Strings in C#

查看:200
本文介绍了字符串匹配的两个大集在C#的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这里的情况是:

我有我已经刮作为一个字符串网页

I have a webpage that I have scraped as a string.

我有一个MSSQL数据库多个领域。例如,汽车模型,它有一个ID和一个名称,如野马或公民。它是预填充汽车的大部分车型。

I have several fields in a MSSQL database. For example, car model, it has an ID and a Name, such as Mustang or Civic. It is pre-filled with most models of car.

我想找到任何匹配在我的模型表中的任何一行。所以,如果我有思域,野马和E350在我的模型表我想找到的任何三个我都刮了页面上的任何次数。

I want to find any match for any row in my models table. So if I have Civic, Mustang and E350 in my Model Table I want to find any occurance of any of the three on the page I have scraped.

什么是有效的办法做到这一点在C#。我使用LINQ到SQL与数据库交互。

What is an efficient way to do this in C#. I am using LINQ to SQL to interface with the db.

是否创建所有车型的字典,标记化的页面,通过迭代的标记有意义吗?或者我应该只是通过令牌迭代,并使用WHERE子句和查询数据库,如果有一个匹配?

Does creating a dictionary of all models, tokenizing the page and iterating through the tokens make sense? Or should I just iterate through the tokens and use a WHERE clause and ask the database if there is a match?

    //Dictionary dic contains all models from the DB, with the name being the key and the id being the value...
    foreach(string pageToken in pageTokens)
    {
         if(dic.ContainsKey(pageToken)) 
         {
              //Do what I need to do
         }
    }

这两种方法似乎可怕的我。对我应该做的有什么建议?有交集的东西我会想象可能是好的?

Both of these methods seem terrible to me. Any suggestions on what I should do? Something with set intersection I would imagine might be nice?

这些方法都不满足时,型号名称是多个word..likeF150加长车厢发生了什么。上思考?

Neither of these methods address what happens when a Model name is more than one word..like "F150 Extended Cab". Thoughts on that?

推荐答案

搜索在一个较大的文本多个字符串是一个很好理解的问题,发生了重大的研究已经取得了为使其快速。这两个最流行和有效的方法是阿霍Corasick算法(我会rcommend这一个)和拉宾,卡普算法。他们用一个小预处理,但幅度不太复杂和放大器的订单;比naieve方法更快(在naieve方法是最坏情况下的O(M * N ^ 2 *ρ)其中m是长串[你刮下网页]和长度n是针的平均长度,p是针的数量)。阿霍Corsaik是线性的。 C#实现它可以在CodeProject上可以找到免费的。

Searching for multiple strings in a larger text is a well-understood problem, and signifigant research has been made into making it fast. The two most popular and effective methods for this are the Aho-Corasick Algorithm (I'd rcommend this one) and the Rabin-Karp Algorithm. They use a little preprocessing, but are orders of magnitude less complex & faster than the naieve method (the naieve method is worst-case O(m*n^2*p) where m is the length of the long string [the webpage you scraped] and n is the average length of the needles and p is the number of needles). Aho-Corsaik is linear. A C# implementation of it can be found at CodeProject for free.

编辑:哎呀,我错了阿霍Corasick的复杂性 - 它在数量和放线; +被分析的字符串的大小的输入字符串[刮下文]加匹配的数目的长度。但它仍然是直线和直线比立方好了很多: - )

Oops, I was wrong about the complexity of Aho-Corasick -- it's linear in the number & length of input strings + the size of the string being analyzed [the scraped text] plus the number of matches. But it's still linear and linear is a lot better than cubic :-).

这篇关于字符串匹配的两个大集在C#的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆