如何检测字符串列表中的常见子字符串 [英] How can I detect common substrings in a list of strings

查看:42
本文介绍了如何检测字符串列表中的常见子字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

给定一组字符串,例如:

Given a set of strings, for example:

EFgreen
EFgrey
EntireS1
EntireS2
J27RedP1
J27GreenP1
J27RedP2
J27GreenP2
JournalP1Black
JournalP1Blue
JournalP1Green
JournalP1Red
JournalP2Black
JournalP2Blue
JournalP2Green

我希望能够检测到这是三组文件:

I want to be able to detect that these are three sets of files:

  • 整个[1,2]
  • J27[红、绿]P[1,2]
  • JournalP[1,2][红、绿、蓝]

是否有任何已知的方法可以解决这个问题——我可以阅读有关此问题的任何已发表的论文?

Are there any known ways of approaching this problem - any published papers I can read on this?

我正在考虑的方法是对每个字符串查看所有其他字符串并找到共同字符和不同字符的位置,试图找到具有最多共同点的字符串集,但我担心这不是很有效并且可能会产生误报.

The approach I am considering is for each string look at all other strings and find the common characters and where differing characters are, trying to find sets of strings that have the most in common, but I fear that this is not very efficient and may give false positives.

请注意,这与 ' 不同我如何检测文件名中的常见字符串组,因为它假设一个字符串后面总是有一系列数字.

Note that this is not the same as 'How do I detect groups of common strings in filenames' because that assumes that a string will always have a series of digits following it.

推荐答案

我会从这里开始:http://en.wikipedia.org/wiki/Longest_common_substring_problem

外部链接中有补充信息的链接,包括文章中解释的两种算法的 Perl 实现.

There are links to supplemental information in the external links, including Perl implementations of the two algorithms explained in the article.

编辑添加:

基于讨论,我仍然认为最长公共子串可能是这个问题的核心.即使在您在评论中引用的 Journal 示例中,该集合的定义特征也是子字符串Journal".

Based on the discussion, I still think Longest Common Substring could be at the heart of this problem. Even in the Journal example you reference in your comment, the defining characteristic of that set is the substring 'Journal'.

我会首先考虑是什么将集合定义为与其他集合分开.这为您提供了划分数据的分区,然后问题在于测量集合中存在多少共性.如果定义特征是一个公共子串,那么最长公共子串将是一个逻辑起点.

I would first consider what defines a set as separate from the other sets. That gives you your partition to divide up the data, and then the problem is in measuring how much commonality exists within a set. If the defining characteristic is a common substring, then Longest Common Substring would be a logical starting point.

为了使集合检测过程自动化,通常,您需要一个成对的共性度量,您可以使用它来衡量所有可能对之间的差异".然后您需要一种算法来计算导致总体最低总差异的分区.如果差异度量不是最长公共子串,那很好,但是您需要确定它将是什么.显然,它需要是您可以衡量的具体内容.

To automate the process of set detection, in general, you will need a pairwise measure of commonality which you can use to measure the 'difference' between all possible pairs. Then you need an algorithm to compute the partition that results in the overall lowest total difference. If the difference measure is not Longest Common Substring, that's fine, but then you need to determine what it will be. Obviously it needs to be something concrete that you can measure.

还要记住,差异测量的属性将影响可用于进行分区的算法.例如,假设 diff(X,Y) 给出了 X 和 Y 之间差异的度量.那么如果你的距离度量使得 diff(A,C) <= diff(A,B) +差异(B,C).显然 diff(A,C) 应该与 diff(C,A) 相同.

Bear in mind also that the properties of your difference measurement will bear on the algorithms that can be used to make the partition. For example, assume diff(X,Y) gives the measure of difference between X and Y. Then it would probably be useful if your measure of distance was such that diff(A,C) <= diff(A,B) + diff(B,C). And obviously diff(A,C) should be the same as diff(C,A).

考虑到这一点,我也开始怀疑我们是否可以将差异"视为任意两个字符串之间的距离,并且在对距离进行严格定义的情况下,我们是否可以尝试某种对输入字符串进行聚类分析.只是一个想法.

In thinking about this, I also begin to wonder whether we could conceive of the 'difference' as a distance between any two strings, and, with a rigorous definition of the distance, could we then attempt some kind of cluster analysis on the input strings. Just a thought.

这篇关于如何检测字符串列表中的常见子字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆