如何有效地在大的二进制文件中搜索模式 [英] How to search pattern in big binary files efficiently

查看:158
本文介绍了如何有效地在大的二进制文件中搜索模式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有几个二进制文件,它们大多大于 10GB . 在此文件中,我想查找带有Python的模式,即模式0x01 0x02 0x030xF1 0xF2 0xF3之间的数据.

I have several binary files, which are mostly bigger than 10GB. In this files, I want to find patterns with Python, i.e. data between the pattern 0x01 0x02 0x03 and 0xF1 0xF2 0xF3.

我的问题:我知道如何处理二进制数据或如何使用搜索算法,但是由于文件的大小,首先完全读取文件效率很低.这就是为什么我认为明智的做法是按块读取文件并在块内搜索模式.

My problem: I know how to handle binary data or how I use search algorithms, but due to the size of the files it is very inefficient to read the file completely first. That's why I thought it would be smart to read the file blockwise and search for the pattern inside a block.

我的目标:我想让Python确定找到的图案的位置(开始和停止).是否可以使用一种特殊的算法甚至Python library来解决问题?

My goal: I would like to have Python determine the positions (start and stop) of a found pattern. Is there a special algorithm or maybe even a Python library that I could use to solve the problem?

推荐答案

在大文件中搜索模式时,常见的方法是将块读取文件到具有读取缓冲区大小+大小的缓冲区中.模式-1.

The common way when searching a pattern in a large file is to read the file by chunks into a buffer that has the size of the read buffer + the size of the pattern - 1.

在第一次读取时,您仅在读取缓冲区中搜索模式,然后从缓冲区末尾重复复制size_of_pattern-1个字符到开头,然后读取一个新的块并在整个缓冲区中进行搜索.这样,即使模式从一个块开始并在下一个块结束,您也一定会发现该模式的任何出现.

On first read, you only search the pattern in the read buffer, then you repeatedly copy size_of_pattern-1 chars from the end of the buffer to the beginning, read a new chunk after that and search in the whole buffer. That way, you are sure to find any occurence of the pattern, even if it starts in one chunk and ends in next.

这篇关于如何有效地在大的二进制文件中搜索模式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆