有快速的算法来删除字符串中重复的子字符串吗? [英] Is there a fast algorithm to remove repeated substrings in a string?

查看:103
本文介绍了有快速的算法来删除字符串中重复的子字符串吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有一个类似的字符串

dxabcabcyyyydxycxcxz

,我想将其合并到

dxabcydxycxz

其他示例:
ddxddx-> dxdx,abbab-> abab。

规则是:

if (adjacent and same): merge

# Such as 'abc',they are same and , so I will delete one of them .
# Although 'dx' is same as 'dx',they are nonadjacent,so I do not delete any of them
# If one character has been deleted, we don't delete any sub-string include it 

我是在python代码中完成的,但是在长字符串中执行时很慢。 / p>

I did it in my code in python,but it's slow when did in a long string.

# original string
mystr = "dxabcabcyyyydxycxcxz"
str_len = len(mystr)
vis = [1] *str_len #Use a list to mark which char is deleted

# enumerate the size of sub-str
for i in range(1,str_len):
    # enumerate the begin of the sub-str
    for j in range(0, str_len):
        offset = 2 #the size of sub-str + 1
        current_sub_str = mystr[j:j+i]
        s_begin = j+i*(offset-1)
        s_end = j+(i*offset)
        # delete all of the same char
        while((j+(i*offset) <= str_len) and current_sub_str == mystr[s_begin:s_end]
              and 0  not in vis[s_begin:s_end] and 0  not in vis[j:j+i]):
            vis[s_begin:s_end] = [0] * (s_end - s_begin) #if I deleted it ,mark it as 0
            offset += 1
            s_begin = j + i * (offset - 1)
            s_end = j + (i * offset)

res = []
for i in range(0,str_len):
    if(vis[i]!=0): res.append(mystr[i])

print "".join(res)

有没有更快的解决方法?

Is there any faster way to solve it?


2017年4月29日更新

update April 29, 2017

对不起,这似乎是一个XY问题。另一方面,也许不是。
有内容

Sorry, it seems to like a XY problem.On the other hand,it maybe not. there is the content

我当时正在为网络蜘蛛编写代码,并且有很多像这样的标记路径

I was coding for a web spider and got many 'tag-path's like those

ul/li/a
ul/li/div/div/div/a/span
ul/li/div/div/div/a/span 
ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a

如您所见,有一些标记路径以相同的方式进行操作,因此我想将它们折叠起来,以查找是否有其他任何具有相同结构的标记路径。
折叠后,我得到了这样的标记路径。

As you see,there are some 'tag-path' did a same way,so I wanted to collapse them to find is there any other 'tag-path's have the same structure. After collapsing, I get the 'tag-path' like this.

ul/li/a
ul/li/div/div/div/a/span
ul/li/a
ul/li/ul/li/a
ul/li/a
ul/li/ul/li/a
ul/li/a
ul/li/ul/li/a

这只是我的主意,我不知道这样是否适合。(尝试之后,我选择了另一种方式。

This is only my idea and I didn't know whether it is suitable to do in this way.(After trying, I chose another way to do it.

但是,还有一个有趣的问题,例如ACM问题。

However there is an interesting question like a ACM question.

因此,我简化了角色的标记路径并寻求帮助。
实际上,这个问题有很多我不介意的问题,感谢所有帮助我完成此工作的人。

So I simplify one 'tag-path' to a character and ask for help.Because I didn't do a fast way by myself. Actually, the question has many corner cases that I don't mind and thank all for helping me complete it.

谢谢。

推荐答案

请注意正则表达式的功能:

Behold the power of regex:

>>> import re

>>> re.sub(r"(.+?)\1+", r"\1", "dxabcabcyyyydxycxcxz")
'dxabcydxycxz'

>>> re.sub(r"(.+?)\1+", r"\1", "ddxddx")
'dxdx'

>>> re.sub(r"(.+?)\1+", r"\1", "abbab")
'abab'

这将查找由1个或多个任意字符组成的序列(。+?)(作为非贪婪匹配) ,因此它会首先尝试较短的序列),然后重复匹配的序列 \1 + 的1次或多次重复,并全部替换为匹配的序列 \1

This looks for a sequence of 1 or more arbitrary characters (.+?) (as a non-greedy match, so that it tries shorter sequences first), followed by 1 or more repetitions of the matched sequence \1+, and replaces it all with just the matched sequence \1.

这篇关于有快速的算法来删除字符串中重复的子字符串吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆