Python字符串处理优化 [英] Python string processing optimization

查看:120
本文介绍了Python字符串处理优化的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以最近我一直在制作一个python脚本来从大文本文件(> 1 GB)中提取数据。这个问题基本上总结了从文件中选择文本行,并从一些数组中搜索字符串(这个数组可以有多达1000个字符串)。这里的问题是我必须找到一个特定的字符串,并且该字符串可能在该文件中出现无限次数。此外,还需要进行一些解码和编码,从而减慢了脚本的速度。
代码如下所示:

  strings = [a for a open('file.txt')] 

打开(er.txt,r)为f:
为f中的块:
为字符串中的s
#为搜索,修剪,剥离..

我的问题在这里是:
有没有一种方法来优化呢?我尝试了多处理,但它有助于(或至少是我实现它的方式)这里的问题是,这些块操作不是独立的,并且字符串列表可能会被更改其中之一。
任何优化都将有所帮助(字符串搜索算法,文件读取等)我尽可能多地做了关于循环中断的事情,但是它仍然运行得非常慢。如果你能确切地知道字符串是如何以二进制(ASCII,UTF-8)编码的,你可以 mmap 整个文件一次存入内存;它的行为与 bytearray / bytes (或Python 2中的 str ) > file.read() would;那么这样一个 mmap 对象可以通过 str 正则表达式(Python 2)或 正则表达式(Python 3)。
$ b $ p 是最快的解决方案在许多操作系统上,因为只读映射意味着操作系统可以在页面准备好时自由映射;不需要交换空间,因为数据是由文件支持的。操作系统还可以直接从缓冲区缓存中映射数据,并进行零拷贝,从而在裸读的情况下实现双赢。


示例:

  import mmap 
import re

pattern = re.compile(b'the final answer is([0- 9)+)')
打开(datafile.txt,rb)为f:
#memory-map文件,size 0表示整个文件
mm = mmap。 mmap(f.fileno(),0,prot = mmap.PROT_READ)

#PROT_READ只在* nix上,因为文件不可写
在pattern.finditer(mm)中匹配:
进程匹配
print(答案是{}。格式(match.group(1).decode('ascii')))

mm.close )

现在,如果 datafile.txt 包含文字:

$ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ >

在1千兆字节的数据中,这个程序将成为最快的python解决方案之一:

  a nswer是42 

注意 pattern.finditer 也接受 start end 这些参数可以用来限制匹配尝试的范围。






正如 ivan_pozdeev 所述,这需要1 GB的免费虚拟地址(但不一定是1GB RAM)映射的空间,这在32位进程中可能很难,但在64位操作系统和CPU上几乎可以肯定是无问题的。在32位的进程中,这个方法仍然可行,但是你需要将大文件映射成更小的块 - 因此现在操作系统和处理器的各个部分真的很重要。

So lately I've been making a python script for extracting data from a large text files ( > 1 GB ). The problem basically sums up to selecting lines of text from the file, and searching them for strings from some array ( this array can have as many as 1000 strings in it). The problem here is that i have to find a specific occurrence of the string, and the string may appear unlimited number of times in that file. Also, some decoding and encoding is required, which additionally slows the script down. Code looks something like this:

strings = [a for a in open('file.txt')]

with open("er.txt", "r") as f:
    for chunk in f:
        for s in strings
            #do search, trimming, stripping ..

My question here is: Is there a way to optimize this? I tried multiprocessing, but it helps little ( or at least the way i implemented it ) The problem here is that these chunk operations aren't independent and strings list may be altered during one of them. Any optimization would help (string search algorithms, file reading etc.) I did as much as i could regarding loop breaking, but it still runs pretty slow.

解决方案

If you can know exactly how the string is encoded in binary (ASCII, UTF-8), you can mmap the entire file into memory at a time; it would behave exactly like a large bytearray/bytes (or str in Python 2) obtained by file.read() would; then such a mmap object would be searchable by a str regular expression (Python 2), or bytes regular expression (Python 3).

The mmap is the fastest solution on many operating systems, because the read-only mapping means that the OS can freely map in the pages as they're ready; no swap space is required, because the data is backed by a file. The OS can also directly map the data from the buffer cache with zero copying - thus a win-win-win over bare reading.

Example:

import mmap
import re

pattern = re.compile(b'the ultimate answer is ([0-9]+)')
with open("datafile.txt", "rb") as f:
    # memory-map the file, size 0 means whole file
    mm = mmap.mmap(f.fileno(), 0, prot=mmap.PROT_READ)

    # PROT_READ only on *nix as the file is not writable
    for match in pattern.finditer(mm):
        # process match
        print("The answer is {}".format(match.group(1).decode('ascii')))

    mm.close()

Now, if the datafile.txt contained the text:

the ultimate answer is 42

somewhere along the 1 gigabyte of data, this program would be among the fastest python solutions to spit out:

The answer is 42

Notice that pattern.finditer also accepts start and end parameters that can used to limit the range where the match is attempted.


As noted by ivan_pozdeev, this requires 1 gigabyte of free virtual address space for mapping a gigabyte file (but not necessarily 1 gigabyte of RAM), which might be difficult in a 32-bit process but can almost certainly be assumed a "no-problem" on 64-bit operating system and CPUs. On 32-bit processes the approach still works, but you need to map big files in smaller chunks - thus now the bits of the operating system and processor truly matter.

这篇关于Python字符串处理优化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆