在C＃中匹配两个大集合的字符串 [英] Matching Two Large Sets of Strings in C#

查看：636 发布时间：2017/5/21 20:25:01 c# sql search dictionary

本文介绍了在C＃中匹配两个大集合的字符串的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

这是这样的情况：

我有一个我作为一个字符串被刮的网页。

I have a webpage that I have scraped as a string.

我在MSSQL数据库中有几个字段。例如，汽车模型，它有一个ID和一个名称，如野马或公民。它是预先填充大多数型号的汽车。

I have several fields in a MSSQL database. For example, car model, it has an ID and a Name, such as Mustang or Civic. It is pre-filled with most models of car.

我想找到我的模型表中任何行的任何匹配。所以如果我在我的模型表中有公民，野马和E350，我想要找到我已经刮下的页面中的三个中的任何一个。

I want to find any match for any row in my models table. So if I have Civic, Mustang and E350 in my Model Table I want to find any occurance of any of the three on the page I have scraped.

什么是有效的在C＃中做到这一点。我正在使用LINQ to SQL与db进行接口。

What is an efficient way to do this in C#. I am using LINQ to SQL to interface with the db.

创建所有模型的字典，标记页面和迭代令牌是有意义的吗？或者我应该循环使用令牌并使用WHERE子句，并询问数据库是否有匹配？

Does creating a dictionary of all models, tokenizing the page and iterating through the tokens make sense? Or should I just iterate through the tokens and use a WHERE clause and ask the database if there is a match?

    //Dictionary dic contains all models from the DB, with the name being the key and the id being the value...
    foreach(string pageToken in pageTokens)
    {
         if(dic.ContainsKey(pageToken)) 
         {
              //Do what I need to do
         }
    }

这两种方法对我来说都是可怕的。有什么建议我应该做什么有一些设置交叉点我想象可能会很好吗？

Both of these methods seem terrible to me. Any suggestions on what I should do? Something with set intersection I would imagine might be nice?

这两种方法都不能解决当一个模型名称不止一个字时会发生什么，如F150扩展驾驶室。

Neither of these methods address what happens when a Model name is more than one word..like "F150 Extended Cab". Thoughts on that?

推荐答案

在较大的文本中搜索多个字符串是一个很好理解的问题，并且已经做出了显着的研究使其快速。这两种最流行和最有效的方法是 Aho-Corasick Algorithm （I' d推荐这个）和 Rabin-Karp算法。他们使用了一些预处理，但是数量级不及复杂。比naieve方法更快（naieve方法是最坏情况O（m * n ^ 2 * p）其中m是长串[你所刮的网页]的长度，n是针的平均长度，p是针数）。 Aho-Corsaik是线性的。 AC＃的实现可以在CodeProject中免费找到。

Searching for multiple strings in a larger text is a well-understood problem, and signifigant research has been made into making it fast. The two most popular and effective methods for this are the Aho-Corasick Algorithm (I'd rcommend this one) and the Rabin-Karp Algorithm. They use a little preprocessing, but are orders of magnitude less complex & faster than the naieve method (the naieve method is worst-case O(m*n^2*p) where m is the length of the long string [the webpage you scraped] and n is the average length of the needles and p is the number of needles). Aho-Corsaik is linear. A C# implementation of it can be found at CodeProject for free.

编辑：哎呀，我对Aho-Corasick的复杂性错了 - 在数字&输入字符串的长度+正在分析的字符串的大小[被删除的文本]加上匹配的数量。但是它仍然是线性的，线性比立方更好： - ）。

Oops, I was wrong about the complexity of Aho-Corasick -- it's linear in the number & length of input strings + the size of the string being analyzed [the scraped text] plus the number of matches. But it's still linear and linear is a lot better than cubic :-).

这篇关于在C＃中匹配两个大集合的字符串的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在C＃中匹配两个大集合的字符串 [英] Matching Two Large Sets of Strings in C#

问题描述

推荐答案

相关文章

C#/.NET最新文章

热门教程

热门工具

登录关闭

在C＃中匹配两个大集合的字符串 [英] Matching Two Large Sets of Strings in C#

问题描述

推荐答案

相关文章

C#/.NET最新文章

热门教程

热门工具

登录 关闭

登录关闭