Python字符串处理优化 [英] Python string processing optimization
问题描述
代码如下所示:
strings = [a for a open('file.txt')]
打开(er.txt,r)为f:
为f中的块:
为字符串中的s
#为搜索,修剪,剥离..
我的问题在这里是: 示例: 现在,如果 在1千兆字节的数据中,这个程序将成为最快的python解决方案之一: 注意 正如 ivan_pozdeev 所述,这需要1 GB的免费虚拟地址(但不一定是1GB RAM)映射的空间,这在32位进程中可能很难,但在64位操作系统和CPU上几乎可以肯定是无问题的。在32位的进程中,这个方法仍然可行,但是你需要将大文件映射成更小的块 - 因此现在操作系统和处理器的各个部分真的很重要。 So lately I've been making a python script for extracting data from a large text files ( > 1 GB ). The problem basically sums up to selecting lines of text from the file, and searching them for strings from some array ( this array can have as many as 1000 strings in it). The problem here is that i have to find a specific occurrence of the string, and the string may appear unlimited number of times in that file. Also, some decoding and encoding is required, which additionally slows the script down.
Code looks something like this: My question here is:
Is there a way to optimize this? I tried multiprocessing, but it helps little ( or at least the way i implemented it ) The problem here is that these chunk operations aren't independent and If you can know exactly how the string is encoded in binary (ASCII, UTF-8), you can The Example: Now, if the somewhere along the 1 gigabyte of data, this program would be among the fastest python solutions to spit out: Notice that As noted by ivan_pozdeev, this requires 1 gigabyte of free virtual address space for mapping a gigabyte file (but not necessarily 1 gigabyte of RAM), which might be difficult in a 32-bit process but can almost certainly be assumed a "no-problem" on 64-bit operating system and CPUs. On 32-bit processes the approach still works, but you need to map big files in smaller chunks - thus now the bits of the operating system and processor truly matter. 这篇关于Python字符串处理优化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
有没有一种方法来优化呢?我尝试了多处理,但它有助于(或至少是我实现它的方式)这里的问题是,这些块操作不是独立的,并且字符串
列表可能会被更改其中之一。
任何优化都将有所帮助(字符串搜索算法,文件读取等)我尽可能多地做了关于循环中断的事情,但是它仍然运行得非常慢。如果你能确切地知道字符串是如何以二进制(ASCII,UTF-8)编码的,你可以 mmap
整个文件一次存入内存;它的行为与 bytearray / bytes
(或Python 2中的 str
) > file.read() would;那么这样一个 mmap
对象可以通过 str
正则表达式(Python 2)或
正则表达式(Python 3)。
$ b $ p
import mmap
import re
pattern = re.compile(b'the final answer is([0- 9)+)')
打开(datafile.txt,rb)为f:
#memory-map文件,size 0表示整个文件
mm = mmap。 mmap(f.fileno(),0,prot = mmap.PROT_READ)
#PROT_READ只在* nix上,因为文件不可写
在pattern.finditer(mm)中匹配:
进程匹配
print(答案是{}。格式(match.group(1).decode('ascii')))
mm.close )
datafile.txt
包含文字:
$ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ >
a nswer是42
pattern.finditer
也接受 start
和 end
这些参数可以用来限制匹配尝试的范围。
strings = [a for a in open('file.txt')]
with open("er.txt", "r") as f:
for chunk in f:
for s in strings
#do search, trimming, stripping ..
strings
list may be altered during one of them.
Any optimization would help (string search algorithms, file reading etc.) I did as much as i could regarding loop breaking, but it still runs pretty slow. mmap
the entire file into memory at a time; it would behave exactly like a large bytearray/bytes
(or str
in Python 2) obtained by file.read()
would; then such a mmap
object would be searchable by a str
regular expression (Python 2), or bytes
regular expression (Python 3).mmap
is the fastest solution on many operating systems, because the read-only mapping means that the OS can freely map in the pages as they're ready; no swap space is required, because the data is backed by a file. The OS can also directly map the data from the buffer cache with zero copying - thus a win-win-win over bare reading.import mmap
import re
pattern = re.compile(b'the ultimate answer is ([0-9]+)')
with open("datafile.txt", "rb") as f:
# memory-map the file, size 0 means whole file
mm = mmap.mmap(f.fileno(), 0, prot=mmap.PROT_READ)
# PROT_READ only on *nix as the file is not writable
for match in pattern.finditer(mm):
# process match
print("The answer is {}".format(match.group(1).decode('ascii')))
mm.close()
datafile.txt
contained the text:the ultimate answer is 42
The answer is 42
pattern.finditer
also accepts start
and end
parameters that can used to limit the range where the match is attempted.