用编码打开内存映射文件 [英] Opening memory-mapped file with encoding

查看:96
本文介绍了用编码打开内存映射文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

内存映射文件是一种有效的使用方式正则表达式或对大型二进制文件进行操作。

Memory mapped file is an efficient way for using regex or doing manipulation on large binary files.

如果我有一个大型文本文件(〜1GB),是否可以使用可识别编码的映射文件?

正则表达式,如 [\u1234-\u5678] bytes 上不起作用对象并将模式转换为Unicode也不起作用(例如 [\u1234-\u5678]。encode( utf-32)

如果将搜索模式从 str 转换为 bytes 使用 .encode(),但是它仍然有些局限,应该有一种更简单的方法来代替一整天的解码和编码。

In case I have a large text file (~1GB), is it possible to work with an encoding-aware mapped file?
Regex like [\u1234-\u5678] won't work on bytes objects and converting the pattern to unicode will not work either (as "[\u1234-\u5678]".encode("utf-32") for example will not understand the range correctly).
Searching might work if I convert the search pattern from str to bytes using .encode() but it's still somewhat limited and there should be a simpler way instead of decoding and encoding all day.

我尝试用 io.TextIOWrapper 将其包装在 io.BufferedRandom 内,但无济于事:

I have tried wrapping it with io.TextIOWrapper inside an io.BufferedRandom but to no avail:

AttributeError: 'mmap.mmap' object has no attribute 'seekable'

创建包装器(使用继承)并设置方法 seekable 可读可写返回 True 也不起作用。

Creating a wrapper (using inheritance) and setting the methods seekable, readable and writable to return True did not work either.

关于编码,固定长度的编码如 utf-32 ,代码点或 utf-16 的较低BMP(如果甚至可能仅指该部分)也可以。

Regarding encoding, a fixed length encoding like utf-32, code-points or the lower BMP of utf-16 (if it's even possible referring just to that part) might be assumed.

任何Python版本都欢迎使用解决方案。

Solution is welcome for any python version.

推荐答案

实质上是从头开始重新发明轮子(编写 re 模块的所有新版本, mmap 模块的所有新版本等),或编写无法使用真正的Unicode字符范围之类的东西的非常复杂的正则表达式(您可能会在三种不同模式之间进行交替,以生成 [\u1234-\u5678] ,类似于(?: \x12 [\x34-\x ff] | [\x13-\x55]。| \x56 [\x00-\x78]))。

You can't do this without essentially reinventing the wheel from scratch (writing all new versions of the re module, the mmap module, etc.), or writing extraordinarily complex regexes that can't use the niceties of stuff like true Unicode character ranges (you'd have an alternation between three different patterns to make [\u1234-\u5678], something like (?:\x12[\x34-\xff]|[\x13-\x55].|\x56[\x00-\x78])).

基本上, re 模式仅适用于 str ,或仅适用于 bytes 的对象(而且您无法尝试使用 memoryview s和强制类型转换解决此问题,因为 re 仍将其视为字节,而不是更大的类型。)

Basically, re patterns only work with str, or work with bytes-like objects (and you can't try to work around it with memoryviews and casts, because re still treats it as bytes, not larger types).

对于简单的搜索,您可以尝试使用 mmap.find 在对用于搜索的字符串进行编码后,但是仍然容易出现细微的错误;对于UCS-2或UTF-32,您需要检查 find 的返回值是否分别在两个或四个字节边界上对齐,以确保您没有将一个字符的结尾和下一个字符的开头误认为一个完全不同的字符。如果对齐测试失败,则必须使用上次返回值的开始偏移量重复1,直到您被点击或查找返回 -1 。在一般情况下,这并不是一件合理的事情。

For simple searches, you could try using mmap.find after encoding the string to use for searching, but that's still prone to subtle bugs; for UCS-2 or UTF-32, you'd need to check that the return value from find was aligned on a two or four byte boundary respectively to ensure you didn't mistake the end of one character and the beginning of the next for a completely different character. If the alignment test failed, you'd have to repeat the search with a start offset of the last return value + 1 until you either got a hit or find returned -1. It's just not a reasonable thing to do in the general case.

这篇关于用编码打开内存映射文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆