在列表中查找模式 [英] Finding patterns in list

查看:169
本文介绍了在列表中查找模式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在寻找一种在整数列表中查找模式的方法,但是我要使用的方法当然适用于字符串和其他具有不同元素的列表。现在,让我解释一下我要寻找的内容。

I am currently searching for a way to find patterns in a list of integers, but the method I am going to use would be applicable to strings and other lists with different elements of course. Now let me explain what I am looking for.

我想在整数列表中找到最长的重复模式。例如,

I want to find the longest repeating pattern in a list of integers. For example,

[1, 2, 3, 4, 1, 2, 3]
# This list would give 1, 2, 3

应该丢弃重叠模式。 (不确定)

Overlapping patterns should be discarded. ( Not certain )

[1, 1, 1, 1, 1]
# Should give 1, 1  Not 1, 1, 1, 1

以下是对我没有帮助的内容。

在列表中查找模式(不理解第一个答案的逻辑,很少解释。第二个答案只有在解决之前知道模式才解决问题。)

Finding patterns in a list (Did not understand the logic behind the first answer, very little explanation. And second answer solves the problem only if the pattern is known before solving.)

< a href = https://stackoverflow.com/questions/33928775/finding-integer-pattern-from-a-list>从列表中查找整数模式(给出了模式并希望出现次数。与我的问题不同。)

Finding integer pattern from a list (Pattern is given and number of occurrence is wanted. Different than my question.)

最长的普通子序列问题(大多数人都处理过这个问题,但是这不是我的问题。在寻找模式时,我需要连续的元素。但是

Longest common subsequence problem (Most people dealed with this problem however it is not close to mine. I need consecutive elements while searching for a pattern. However in this, seperate elements also counted as subsequences.)

这是我尝试过的内容。

def pattern(seq):
    n = len(seq)
    c = defaultdict(int) # Counts of each subsequence
    for i in xrange(n):
        for j in xrange(i + 1, min(n, n / 2 + i)): 
            # Used n / 2 because I figured if a pattern is being searched
            # It cant be longer that the half of the list.
            c[tuple(seq[i:j])] += 1      
    return c

如您所见,它将找到所有子列表并检查重复。我发现这种方法有些幼稚(效率低下),我需要一种更好的方法。请帮我。

As you see, It finds all the sublists and check for repeats. I found this approach a bit naive(and inefficient) and I am in need of a better way. Please help me. Thanks in advance.

注意1:该列表是预先确定的,但是由于我的算法失败,我只能在冻结之前检查列表的某些部分电脑。因此,我尝试查找的模式可能会比搜索列表的一半长得多,甚至会比搜索列表本身更长,因为我只搜索原始列表的一部分。我正在使用,我可以搜索原始列表的较大部分,因此可以更好地找到模式。 (如果有的话。)

Note1: The list is predetermined but because of my algorithms failure, I can only check some parts of the list before freezing the computer. So the pattern I am trying to find can very well be longer than the half of the search list, It can even be longer than the search list itself because I am searching only a part of the original list.If you present a better method than I am using, I can search a larger part of the original list so I will have a better chance at finding the pattern. (If there is one.)

注意2::如果您想自己进行测试,则这是列表的一部分。似乎确实存在一种模式,但是在使用可靠的代码测试它之前我无法确定。
示例列表

Note2: Here is a part of the list if you want to test it yourself. It really seems like that there is a pattern, but I cannot be sure before I test it with a reliable code. Sample List

注意3:我认为这是一个严重的数据挖掘问题。并会尝试学习如果您进行详细说明。感觉这比LCS更为重要,但是LCS更为流行:D我正在尝试寻找这种方法,就像科学家用来发现DNA模式的方法一样。

Note3: I approach this as a serious problem of data mining. And will try to learn if you make a long explanation. This feels like a much more important problem than LCS, however LCS is much more popular :D This method I am trying to find feels like the methods scientists use to find DNA patterns.

推荐答案

代码



忽略不重叠的要求,这是我使用的代码:

The Code

Ignoring the "no overlapping" requirement, here's the code I used:

def pattern(seq):
        storage = {}
        for length in range(1,len(seq)/2+1):
                valid_strings = {}
                for start in range(0,len(seq)-length+1):
                        valid_strings[start] = tuple(seq[start:start+length])
                candidates = set(valid_strings.values())
                if len(candidates) != len(valid_strings.values()):
                        print "Pattern found for " + str(length)
                        storage = valid_strings
                else:
                        print "No pattern found for " + str(length)
                        return set(filter(lambda x: storage.values().count(x) > 1, storage.values()))
        return storage

使用它,我发现8数据集中长度303的不同模式。该程序也运行得非常快。

Using that, I found 8 distinct patterns of length 303 in your dataset. The program ran pretty fast, too.

define patterns(sequence):
    list_of_substrings = {}
    for each valid length:  ### i.e. lengths from 1 to half the list's length
        generate a dictionary my_dict of all sub-lists of size length
        if there are repeats:
            list_of_substrings = my_dict
        else:
            return all repeated values in list_of_substrings
    return list_of_substrings  #### returns {} when there are no patterns

这篇关于在列表中查找模式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆