字符串匹配的两个大集在C＃ [英] Matching Two Large Sets of Strings in C#

查看：200 发布时间：2016/9/26 13:11:29 c# sql search dictionary

本文介绍了字符串匹配的两个大集在C＃的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

这里的情况是：

我有我已经刮作为一个字符串网页

I have a webpage that I have scraped as a string.

我有一个MSSQL数据库多个领域。例如，汽车模型，它有一个ID和一个名称，如野马或公民。它是预填充汽车的大部分车型。

I have several fields in a MSSQL database. For example, car model, it has an ID and a Name, such as Mustang or Civic. It is pre-filled with most models of car.

我想找到任何匹配在我的模型表中的任何一行。所以，如果我有思域，野马和E350在我的模型表我想找到的任何三个我都刮了页面上的任何次数。

I want to find any match for any row in my models table. So if I have Civic, Mustang and E350 in my Model Table I want to find any occurance of any of the three on the page I have scraped.

什么是有效的办法做到这一点在C＃。我使用LINQ到SQL与数据库交互。

What is an efficient way to do this in C#. I am using LINQ to SQL to interface with the db.

是否创建所有车型的字典，标记化的页面，通过迭代的标记有意义吗？或者我应该只是通过令牌迭代，并使用WHERE子句和查询数据库，如果有一个匹配？

Does creating a dictionary of all models, tokenizing the page and iterating through the tokens make sense? Or should I just iterate through the tokens and use a WHERE clause and ask the database if there is a match?

    //Dictionary dic contains all models from the DB, with the name being the key and the id being the value...
    foreach(string pageToken in pageTokens)
    {
         if(dic.ContainsKey(pageToken)) 
         {
              //Do what I need to do
         }
    }

这两种方法似乎可怕的我。对我应该做的有什么建议？有交集的东西我会想象可能是好的？

Both of these methods seem terrible to me. Any suggestions on what I should do? Something with set intersection I would imagine might be nice?

这些方法都不满足时，型号名称是多个word..likeF150加长车厢发生了什么。上思考？

Neither of these methods address what happens when a Model name is more than one word..like "F150 Extended Cab". Thoughts on that?

推荐答案

搜索在一个较大的文本多个字符串是一个很好理解的问题，发生了重大的研究已经取得了为使其快速。这两个最流行和有效的方法是阿霍Corasick算法（我会rcommend这一个）和拉宾，卡普算法。他们用一个小预处理，但幅度不太复杂和放大器的订单;比naieve方法更快（在naieve方法是最坏情况下的O（M * N ^ 2 *ρ）其中m是长串[你刮下网页]和长度n是针的平均长度，p是针的数量）。阿霍Corsaik是线性的。 C＃实现它可以在CodeProject上可以找到免费的。

Searching for multiple strings in a larger text is a well-understood problem, and signifigant research has been made into making it fast. The two most popular and effective methods for this are the Aho-Corasick Algorithm (I'd rcommend this one) and the Rabin-Karp Algorithm. They use a little preprocessing, but are orders of magnitude less complex & faster than the naieve method (the naieve method is worst-case O(m*n^2*p) where m is the length of the long string [the webpage you scraped] and n is the average length of the needles and p is the number of needles). Aho-Corsaik is linear. A C# implementation of it can be found at CodeProject for free.

编辑：哎呀，我错了阿霍Corasick的复杂性 - 它在数量和放线; +被分析的字符串的大小的输入字符串[刮下文]加匹配的数目的长度。但它仍然是直线和直线比立方好了很多： - ）

Oops, I was wrong about the complexity of Aho-Corasick -- it's linear in the number & length of input strings + the size of the string being analyzed [the scraped text] plus the number of matches. But it's still linear and linear is a lot better than cubic :-).

这篇关于字符串匹配的两个大集在C＃的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

字符串匹配的两个大集在C＃ [英] Matching Two Large Sets of Strings in C#

问题描述

推荐答案

相关文章

C#/.NET最新文章

热门教程

热门工具

登录关闭

字符串匹配的两个大集在C＃ [英] Matching Two Large Sets of Strings in C#

问题描述

推荐答案

相关文章

C#/.NET最新文章

热门教程

热门工具

登录 关闭

登录关闭