自动构建适合字符串集的正则表达式 [英] Automatically built regex expressions that fit set of strings

查看:80
本文介绍了自动构建适合字符串集的正则表达式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们已经编写了该系统来分析来自大型网络的日志消息。该系统从许多不同的网络元素中获取日志消息,并通过正则表达式进行分析。例如,用户可能编写了两个规则:

We have written the system to analyse log messages from the large network. The system takes log messages from lots of different network elements, and analyses it by regex expressions. For example user may have written two rules:

^cron/script\.sh.*
.*script\.sh [0-9]+$

在这种情况下,仅匹配给定模式的日志被选中。过滤的原因是实际上可能有很多日志消息,每天最多1 GB。

In this case only logs that match given patterns will be selected. The reason of the filtering is that there may be really lots of log messages, up to 1 GB per day.

现在,我的问题的主要部分。既然有很多网络元素,并且有几种类型,并且每个元素在路径中都有不同的参数...是否有任何方法可以自动生成一组以某种方式对日志进行分组的正则表达式?该系统可以学习历史数据,例如从上周开始。生成的正则表达式一定不能非常准确,这应该是用户向系统添加这种新规则的提示。

Now the main part of my question. Since there is lots of network elements, and several types of them, and every one of them has different parameters in path... Is there any way to automatically generate set of regexes that will somehow group the logs? The system can learn on historical data, e.g. from the last week. Generated regex must not be very accurate, it is supposed to be the hint for user to add such new rule into system.

我当时在考虑无监督机器学习来划分输入分成几组,然后在每组中找到合适的正则表达式。还有其他方法,也许更快或更更好?最后但并非最不重要的一点是,如何查找与所获得的组中的所有字符串匹配的正则表达式? (非平凡的,所以。* 不是答案。)

I was thinking about unsupervised machine learning to divide input into groups and then in each group find proper regex. Is there any other way, maybe faster or better? And, last but not least, how to find regex that matches all strings in obtained group? (Non-trivial, so .* is not the answer.)

编辑,经过一番思考,我将尝试简化问题。假设我已经将日志分组了。我想找到(最多)三个集合中所有字符串共有的最大子字符串(至少一个)。例如:

Edit After some thinking I'll try to simplify the problem. Suppose I have already grouped logs. I'd like to find (at most) three largest substrings (at least one) common to all the strings in set. For example:

Set of strings:
cron/script1.sh -abc 1243 all
cron/script2.sh 1
bin/script1.sh -asdf 15

Obtained groups:
/script
.sh 

现在,我可以通过将这些组与连接起来来构建一些简单的正则表达式。*?。在此示例中,它是。*?(/ script)。*?(\.sh)。*?。似乎是更简单的解决方案。

Now I can build some simple regex by concatenating these groups with .*?. In this example it would be .*?(/script).*?(\.sh ).*?. It seems to be simpler solution.

推荐答案

好的,我们将尝试将其分解为可管理的步骤。

OK, we'll try to break this down into manageable steps.

  1. For each substring w in s1, in order of non-increasing length,
  2.  assume w is a substring of the other sM
  3.  for each string of the other sN,
  4.   if w is not a substring of sN, disprove assumption and break
  5.  if the assumption held, save w
  6.  if you've found three w that work, break
  7. You have recorded between 0 and 3 w that work.

请注意,并非所有字符串集都保证具有公共子字符串(空字符串除外)。在最坏的情况下,假设s1是最长的字符串。 s1有O(n ^ 2)个子字符串(| s1 | = n),需要O(n)才能与其他m个字符串进行比较...因此,渐近复杂度为O(n ^ 2 * nm)...即使该算法是幼稚的,这也应该是很容易管理的(毕竟是多项式,并且是二次函数)。

Note that not all sets of strings are guaranteed to have common substrings (except the empty string). In the worst case, assume s1 is the longest string. There are O(n^2) substrings of s1 (|s1| = n) and it takes O(n) to compare to each of m other strings... so the asymptotic complexity is, I believe, O(n^2 * nm)... even though the algorithm is naive, this should be pretty manageable (polynomial, after all, and quadratic at that).

向eg的转换C代码应该简单明了...使用带有递减长度循环的滑动窗口获取s1的子字符串,然后使用线性搜索器查找其他字符串中的匹配项。

The transformation to e.g. C code should be straightforward... use a sliding window with a decrementing length loop to get substrings of s1, and then use linear searchers to find matches in the other strings.

我敢肯定,这样做有更聪明/渐近的更好方法,但是任何算法都必须查看所有字符串中的所有字符,因此O(nm)...可能并不完全正确。

I'm sure there are smarter / asymptotically better ways of doing this, but any algorithm will have to look at all characters in all strings, so O(nm)... may not be completely right here.

这篇关于自动构建适合字符串集的正则表达式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆