在大列表中查找重复子列表 [英] Find Repeating Sublist Within Large List

查看:56
本文介绍了在大列表中查找重复子列表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个很大的子列表列表(大约 16000 个),我想找到重复模式的开始和结束位置.我不是 100% 确定有重复,但是我有充分的理由相信,因为出现在子列表序列中的对角线.子列表列表的结构是首选,因为它在此脚本中用于其他内容.数据如下所示:

I have a large list of sub-lists (approx. 16000) that I want to find where the repeating pattern starts and ends. I am not 100% sure that there is a repeat, however I have a strong reason to believe so, due to the diagonals that appear within the sub-list sequence. The structure of a list of sub-lists is preferred, as it is used that way for other things in this script. The data looks like this:

data = ['1100100100000010',
        '1001001000000110',
        '0010010000001100',
        '0100100000011011', etc

我没有任何时间限制,但是最快的方法不会被拒绝.代码应该能够返回列表中的开始/结束序列和位置,以便将来调用.如果有更有用的数据安排,我可以在必要时尝试重新格式化.Python 是我过去几个月一直在学习的东西,所以我还不能从头开始创建自己的算法.谢谢!

I do not have any time constraints, however the fastest method would not be frown upon. The code should be able to return the starting/ending sequence and location within the list, to be called upon in the future. If there is an arrangement of the data that would be more useful, I can try to reformat it if necessary. Python is something that I have been learning for the past few months, so I am not quite able to just create my own algorithms from scratch just yet. Thank you!

推荐答案

这里有一些相当简单的代码,用于扫描一个字符串中是否有相邻的重复子序列.将 minrun 设置为要检查的最小子序列的长度.对于每个匹配项,代码打印第一个子序列的起始索引、子序列的长度和子序列本身.

Here's some fairly simple code that scans a string for adjacent repeating subsequences. Set minrun to the length of the smallest subsequences that you want to check. For each match, the code prints the starting index of the first subsequence, the length of the subsequence, and the subsequence itself.

data = [
    '1100100100000010',
    '1001001000000110',
    '0010010000001100',
    '0100100000011011',
]
data = ''.join(data)

minrun = 3
lendata = len(data)
for runlen in range(minrun, lendata // 2):
    i = 0
    while i < lendata - runlen * 2:
        s1 = data[i:i + runlen]
        s2 = data[i + runlen:i + runlen * 2]
        if s1 == s2:
            print(i, runlen, s1)
            i += runlen 
        else:
            i += 1

输出

1 3 100
4 3 100
8 3 000
15 3 010
18 3 010
23 3 000
32 3 001
38 3 000
47 3 001
53 3 000
17 15 001001000000110
32 15 001001000000110

请注意,我们在索引 15 和 18 = 15 + 3 处得到了相同的长度为 3 的序列:010;这表明 010 有 3 个相邻的副本.类似地,在长度为 15 的索引 17 处有 3 个相邻的序列副本.

Note that we get the same sequence of length 3 at index 15 and 18 = 15 + 3 : 010; that indicates that there are 3 adjacent copies of 010. Similarly, there are 3 adjacent copies of the sequence at index 17 of length 15.

这篇关于在大列表中查找重复子列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆