大文本文件中最快的文本搜索方法 [英] Fastest Text search method in a large text file
问题描述
我正在一个相当大的txt文件(100k行,7mo)中进行文本搜索 文字不是很大,但我需要进行大量搜索. 我想查找目标字符串并返回出现的行. 我的文本文件已格式化,因此目标只能出现在一行中.
I am doing a text search in a rather big txt file (100k lines, 7mo) Text is not that big but I need a lot of searches. I want to look for a target string and return the line where it appears. My text file is formatted so that the target can only appear in one line.
最有效的方法是什么?我做了很多搜索,所以我想提高速度. 现在是mycode:
What is the most efficient way? I do a lot of searches so I want to improve speed. Here is mycode right now:
def lookup_line(target):
#returns line of the target, or None if doesnt exist
line=None
dir=os.path.dirname(__file__)
path=dir+'/file.txt'
file=open(path,'r')
while line==None:
l=file.readline()
l=unicode(l,'utf-8')
if target in l:
break
if l=='': break #happens at end of file, then stop loop
line=l
if line=='':line=None #end of file, nothing has been found
file.close()
return line
我将这个python代码用于Google Appengine应用.
I use this python code for a google Appengine app.
谢谢!
推荐答案
- 立即将整个文本加载到RAM中.不要逐行阅读.
- 在斑点中搜索图案.如果找到它,请使用
text.count('\n',0,pos)
获取行号. - 如果不需要行号,请查找上一个和下一个EOL,以将行从文本中切出.
- Load the whole text in RAM at once. Don't read line by line.
- Search for the pattern in the blob. If you find it, use
text.count('\n',0,pos)
to get the line number. - If you don't need the line number, look for the previous and next EOL to cut the line out of the text.
Python中的循环很慢.字符串搜索非常快.如果需要查找多个字符串,请使用正则表达式.
The loop in Python is slow. String searching is very fast. If you need to look for several strings, use regular expressions.
如果这还不够快,请使用grep
这样的外部程序.
If that's not fast enough, use an external program like grep
.
这篇关于大文本文件中最快的文本搜索方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!