莱文斯坦DFA在.NET [英] Levenshtein DFA in .NET

查看:290
本文介绍了莱文斯坦DFA在.NET的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

下午好,

有谁知道的即开即用的实施莱文斯坦的DFA(确定有限自动机的)的.NET(或容易翻译吧)?我有一个非常大辞典超过16万不同的话,我想,给予inicial字的是W 的,发现在Levenshtein距离所有已知的话顶多2的是W 的以有效的方式。

Does anyone know of an "out-of-the-box" implementation of Levenshtein DFA (deterministic finite automata) in .NET (or easily translatable to it)? I have a very big dictionary with more than 160000 different words, and I want to, given an inicial word w, find all known words at Levenshtein distance at most 2 of w in an efficient way.

当然,具有功能,计算所有可能的编辑在一个给定单词的编辑距离之一,并再次将其应用到每个这些编辑的解决问题(和在一个pretty的straightforwad方式)。现在的问题是效率刍议---给予7字母的单词,这已经可以拿1秒以上即可完成,而我需要的东西的的更有效的---如果可能的话,因为它与莱文斯坦的DFA,一个解决方案,需要O( | W | 的)。步

Of course, having a function which computes all possible edits at edit distance one of a given word and applying it again to each of these edits solves the problem (and in a pretty straightforwad way). The problem is effiency --- given a 7 letter word, this can already take over 1 second to complete, and I need something much more efficient --- if possible, as it is with Levenshtein DFAs, a solution that takes O(|w|) steps.

编辑:我知道,我可以构建自己的解决问题的方法与学习一点点,但此刻我不能读舒尔茨和米霍夫的60页长的文章

I know I can construct my own approach to the problem with a little bit of studying, but at the moment I can't afford reading Schulz and Mihov's 60-page-long articles.

非常感谢你。

推荐答案

我们实现了这个为Apache Lucene的java的,也许你可以将其转换为C#和节省自己的时间。

We implemented this for apache lucene java, perhaps you could convert it to C# and save yourself time.

主类是在这里:它只是一个制造商获得莱文斯坦的DFA从字符串,用舒尔茨和米霍夫算法

the main class is here: its just a builder to get Levenshtein DFAs from a string, using the Schulz and Mihov algorithm.

<一个href="http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/src/java/org/apache/lucene/util/automaton/LevenshteinAutomata.java" rel="nofollow">http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/src/java/org/apache/lucene/util/automaton/LevenshteinAutomata.java

的参数描述(在precomputed表)的LEV1和LEV2在这里: <一href="http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/src/java/org/apache/lucene/util/automaton/Lev1ParametricDescription.java" rel="nofollow">http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/src/java/org/apache/lucene/util/automaton/Lev1ParametricDescription.java

the parametric descriptions (the precomputed tables) for Lev1 and Lev2 are here: http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/src/java/org/apache/lucene/util/automaton/Lev1ParametricDescription.java

<一个href="http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/src/java/org/apache/lucene/util/automaton/Lev2ParametricDescription.java" rel="nofollow">http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/src/java/org/apache/lucene/util/automaton/Lev2ParametricDescription.java

您可能会注意到,这些都与计算机生成,我们产生他们这个脚本,用让 - 菲利普发夹的伟大摩曼实现(蟒蛇) <一href="http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/src/java/org/apache/lucene/util/automaton/createLevAutomata.py" rel="nofollow">http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/src/java/org/apache/lucene/util/automaton/createLevAutomata.py

you might notice these are generated with a computer, we generated them with this script, using Jean-Phillipe Barrette's great moman implementation (python) http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/src/java/org/apache/lucene/util/automaton/createLevAutomata.py

我们生成参数描述为包装的长[]数组,这样它不会让我们的jar文件过大。

we generate the parametric descriptions as packed long[] arrays so that it won't make our jar file too large.

只需修改toAutomaton(INT N),以满足您的需求/ DFA包。在我们的例子中,我们使用的是金砖四国的自动包装,其中转换被重新psented为UNI code $ C $连接点范围$ P $的改进形式。

just modify the toAutomaton(int n) to fit your needs/DFA package. in our case we are using a modified form of the brics automaton package, where transitions are represented as unicode codepoint ranges.

有效的单元测试是困难的这样的事情,但这里是我们想出了...这似乎是彻底的,甚至发现了摩曼实现中的错误(这是由作者立即修复!)。

efficient unit tests are difficult for this sort of thing, but here is what we came up with... it seems to be thorough and even found a bug (which was fixed immediately by the author!) in the moman implementation.

<一个href="http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/src/test/org/apache/lucene/util/automaton/TestLevenshteinAutomata.java" rel="nofollow">http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/src/test/org/apache/lucene/util/automaton/TestLevenshteinAutomata.java

这篇关于莱文斯坦DFA在.NET的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆