快速动态模糊搜索超过10万+字符串在C# [英] Fast Dynamic Fuzzy search over 100k+ strings in C#

查看:319
本文介绍了快速动态模糊搜索超过10万+字符串在C#的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

比方说,他们是pre装股票代码,输入到一个文本框。我要寻找code,我可以复制,而不是一个库进行安装。

Let's say they are pre-loaded stock symbols, typed into a text box. I am looking for code that I can copy, not a library to install.

这个灵感来自于这样一个问题:

This was inspired by this question:

<一个href="http://stackoverflow.com/questions/83777/are-there-any-fuzzy-search-or-string-similarity-functions-libraries-written-for-c">Are有任何模糊搜索或字符串相似函数库编写的C#?

在莱文施泰因距离算法似乎运作良好,但它需要时间来计算。 是否有回避的事实任何优化查询将需要一个额外的信中重新运行作为用户类型?我很感兴趣,表示在每个输入最多的前10场比赛。

The Levenstein distance algorithm seems to work well, but it takes time to compute. Are there any optimizations around the fact that the query will need to re-run as the user types in an extra letter? I am interested in showing at most the top 10 matches for each input.

推荐答案

您需要确定在你的字符串匹配规则。是什么决定了类似的字符串

You need to determine the matching rules around your strings. What determines a 'similar string'

  • 的匹配字符数
  • 非匹配字符数
  • 类似的长度
  • 错别字或拼音错误
  • 在业务专用的缩写
  • 必须以相同的子
  • 必须使用相同字符串结束

我已经做了不少工作,与字符串匹配算法,而我还没有找到任何现有的库或code,以满足我的特定需求。回顾这些,借用想法从他们那里,但你总是需要定制和编写自己的code。

I've done quite a lot of work with string matching algorithms, and am yet to find any existing library or code that meets my specific requirements. Review them, borrow ideas from them, but you will invariably have to customize and write your own code.

在莱文施泰因算法是好的,但有点慢。我已经受够了这两个史密斯 - 沃特曼和放一定的成功;哈罗 - 温克勒算法,但我发现我的目的,最好是蒙日(从内存中)。但它支付给阅读的原创性研究,并确定为什么他们编写的算法和他们的目标数据集。

The Levenstein algorithm is good but a bit slow. I've had some success with both Smith-Waterman & Jaro-Winkler algorithms, but the best I found for my purpose was Monge (from memory). However it pays to read the original research and determine why they've written their algorithms and their target dataset.

如果你没有正确定义要匹配和衡量,那么你会发现高分意想不到比赛和预期匹配低分什么。字符串匹配的非常的特定领域。如果没有正确定义域,那么你就像没有线索渔夫,四周抛钩和最好的希望。

If you don't properly define what you want to match and measure then you'll find high scores on unexpected matches and low scores on expected matches. String matching is very domain specific. If you don't properly define your domain then you are like a fisherman without a clue, throwing hooks around and hoping for the best.

这篇关于快速动态模糊搜索超过10万+字符串在C#的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆