Python检查非常大的字符串是否包含子字符串的有效方法 [英] Python efficient way to check if very large string contains a substring
问题描述
Python 不是我最好的语言,所以我不太擅长为我的一些问题找到最有效的解决方案.我有一个非常大的字符串(来自 30 MB 的文件),我需要检查该文件是否包含较小的子字符串(该字符串只有几十个字符).我目前的做法是:
Python is not my best language, and so I'm not all that good at finding the most efficient solutions to some of my problems. I have a very large string (coming from a 30 MB file) and I need to check if that file contains a smaller substring (this string is only a few dozen characters). The way I am currently doing it is:
if small_string in large_string:
# logic here
但这似乎非常低效,因为它会检查文件中每个可能的字符序列.我知道在换行符上只会有完全匹配,所以将文件作为列表读入并遍历该列表进行匹配会更好吗?
But this seems to be very inefficient because it will check every possible sequence of characters in the file. I know that there will only be an exact match on a newline, so would it be better to read the file in as a list and iterate through that list to match?
为了澄清仅匹配换行符"的一些混淆,这里有一个例子:
To clear up some confusion on "matching on a newline only", here's an example:
small_string = "This is a line"
big_string = "This is a line\nThis is another line\nThis is yet another"
如果我没记错的话,in 关键字将检查所有序列,而不仅仅是每一行.
If I'm not mistake, the in keyword will check all sequences, not just every line.
推荐答案
您可以使用以下算法之一:
You can use one of these algorithms:
Knuth–Morris–Pratt 算法(又名 KMP)在这里看到一个实现
虽然我认为KMP效率更高,但实现起来更复杂.第一个链接包含一些伪代码,应该可以很容易地在python中实现.
Although I believe KMP is more efficient, it's more complicated to implement.The first link includes some pseudo-code that should make it very easy to implement in python.
您可以在这里寻找替代方案:http://en.wikipedia.org/wiki/String_searching_algorithm一个>
you can look for alternatives here: http://en.wikipedia.org/wiki/String_searching_algorithm
这篇关于Python检查非常大的字符串是否包含子字符串的有效方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!