独立于符号的字符串的模式匹配 [英] Pattern matching for strings independent from symbols

查看:74
本文介绍了独立于符号的字符串的模式匹配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要一种算法,该算法可以在数据中找到预定义的模式(以字符串形式出现),而与数据和模式的实际符号/字符无关.我只关心符号之间的关系,而不关心符号本身.对数据中的同一符号使用不同的图案符号也是合法的.模式匹配算法必须强制执行的唯一操作是,保留模式中同一符号的多次出现.举个例子:

I have need for an algorithm which can find pre-defined patterns in data (which is present in the form of strings) independent from the actual symbols/characters of the data and the pattern. I only care about the relations between the symbols, not the symbols themselves. It is also legal to have different pattern symbols for the same symbol in the data. The only thing the pattern matching algorithm has to enforce is that multiple occurences of the same symbol in the pattern are preserved. To give you an example:

此格式为 abca ,因此第一个字母和最后一个字母相同.对于我的应用程序,等效的写法是 1 2 3 1 ,其中数字只是变量.我拥有的数据是 thistextisatest .生成的算法应在这里为我提供两个正确的匹配项,即文本测试.因为只有在这两种情况下,第一个字母和第四个字母才相同,就像在模式中一样.

The pattern is abca, so the first and the last letter are the same. For my application, an equivalent way to write this would be 1 2 3 1, where the digits are just variables. The data I have is thistextisatest. The resulting algorithm should give me two correct matches here, text and test. Because only in these two cases, the first and the fourth letter are the same, as in the pattern.

作为第二个示例,模式 abcd 应该返回12个匹配项(此textisat中的每个位置一个).由于模式中的变量没有重复,因此它在任何地方都是微不足道的.即使在文本 test 的情况下,因为合法的是模式的变量 a d 映射到相同的符号.

As a second example, the pattern abcd should return 12 matches (one for each position in thistextisat). Since no variable in the pattern is repeated, it is trivially matched everywhere. Even in the case of text and test, because it is legal that the variables a and d of the pattern map to the same symbol.

该算法的目标应该是检测书面语言中的相似性.想象一下,拥有一本英语词典,并以 unseen 或等效的 1 2 3 4 4 2 模式对其进行解析.然后,您会看到例如 belittle 包含相同的字母模式.

The goal of this algorithm should be to detect similarities in written language. Imagine having a dictionary of the English language and parsing it with the pattern unseen or equivalently 1 2 3 4 4 2. You would then see that, for example, the word belittle contains the same pattern of letters.

因此,现在我希望明确说明了我的需求,我有一些问题:

So, now that I hopefully made clear what I need, I have some questions:

  • 此算法称为什么?是一个众所周知的问题已经解决了吗?

  • What is this algorithm called? Is it a well-known problem that has been solved?

是否有关于此事的出版物?当您不知道将正确的搜索条件与常规模式匹配区分开来的正确搜索词时,很难找到有用的东西.

Are there publications on the matter? It is really hard to find anything useful when you don't know the correct search terms to separate this problem from regular pattern matching.

我没有将Regex用于过于复杂的事情,所以我不知道在Regex中是否甚至可以进行这样的事情,因为您基本上不关心符号本身,而只是考虑它们的出现方式

I have not used Regex for anything too complicated, so I don't know if anything like this would even be possible in Regex, when you basically do not care about the symbols as such, but only consider the pattern of their occurences.

非常感谢您的帮助!

推荐答案

我认为您不需要在这里使用正则表达式.您的搜索词:

I don't think you need regular expressions here. Your search term:

unseen
123442

它有六个字符,因此将文本中的每个单词编入6个字符

This has six characters, so index each word of your text into 6-mers

脆弱

12,12,12,12,11,12,12 2-mers
123,123,123,122,112,123 3-mers
1234,1234,1233,1223,1123 4-mers
12345,12344,12334,12234 5-mers
123455,123442,123321 6-mers

因此,仅看6人,您就有了比赛.小于您的搜索词的任何6位数字也将是一个匹配项,以允许abcd(1234)大小写与abca(1231)单词匹配.

So just looking at the 6-mers, you've got a match. Any 6 digit number less than your search term would also be a match, to allow for the abcd (1234) case matching an abca (1231) word.

因此,给定一个n个字符的搜索词,只需将每个单词拆分成其组成的n-mers,然后检查数字是否等于或小于.

So given a search term of n characters, just split each word into its constituent n-mers and check for numeric equal or less than.

这篇关于独立于符号的字符串的模式匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆