Python:ctypes可散列的c_char数组替换,而不会导致'\ 0'字节跳闸 [英] Python: ctypes hashable c_char array replacement without tripping over '\0' bytes

查看:91
本文介绍了Python:ctypes可散列的c_char数组替换,而不会导致'\ 0'字节跳闸的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

出于说明目的,此脚本创建一个文件mapfile,其中包含文件的内容(作为参数提供),并以带有sha1校验和的二进制头开头,以允许在后续运行中进行重复检测.

For illustration purposes, this script creates a file mapfile containing the content of the files, given as arguments, prepended by a binary header with a sha1 checksum, that allows for duplication detection on subsequent runs.

这里需要的是可散列的ctypes.c_char替换,它可以以最小的模糊度保存sha1校验和,但不会阻塞'\ 0'字节.

What's needed here is a hashable ctypes.c_char replacement, that can hold sha1 checksums with minimum fuzz, but not choking on '\0' bytes.

# -*- coding: utf8 -*

import io
import mmap
import ctypes
import hashlib
import logging

from collections import OrderedDict

log = logging.getLogger(__file__)

def align(size, alignment):
    """return size aligned to alignment"""
    excess = size % alignment
    if excess:
        size = size - excess + alignment
    return size


class Header(ctypes.Structure):
    Identifier = b'HEAD'
    _fields_ = [
        ('id', ctypes.c_char * 4),
        ('hlen', ctypes.c_uint16),
        ('plen', ctypes.c_uint32),
        ('name', ctypes.c_char * 128),
        ('sha1', ctypes.c_char * 20),
    ]
HeaderSize = ctypes.sizeof(Header)

class CtsMap:
    def __init__(self, ctcls, mm, offset = 0):
        self.ctcls = ctcls
        self.mm = mm
        self.offset = offset

    def __enter__(self):
        mm = self.mm
        offset = self.offset
        ctsize = ctypes.sizeof(self.ctcls)
        if offset + ctsize > mm.size():
            newsize = align(offset + ctsize, mmap.PAGESIZE)
            mm.resize(newsize)
        self.ctinst = self.ctcls.from_buffer(mm, offset)
        return self.ctinst

    def __exit__(self, exc_type, exc_value, exc_traceback):
        del self.ctinst
        self.ctinst = None

class MapFile:
    def __init__(self, filename):
        try:
            # try to create initial file
            mapsize = mmap.PAGESIZE
            self._fd = open(filename, 'x+b')
            self._fd.write(b'\0' * mapsize)
        except FileExistsError:
            # file exists and is writable
            self._fd = open(filename, 'r+b')
            self._fd.seek(0, io.SEEK_END)
            mapsize = self._fd.tell()
        # mmap this file completely
        self._fd.seek(0)
        self._mm = mmap.mmap(self._fd.fileno(), mapsize)
        self._offset = 0
        self._toc = OrderedDict()
        self.gen_toc()

    def gen_toc(self):
        while self._offset < self._mm.size():
            with CtsMap(Header, self._mm, self._offset) as hd:
                if hd.id == Header.Identifier and hd.hlen == HeaderSize:
                    self._toc[hd.sha1] = self._offset
                    log.debug('toc: [%s]%s: %s', len(hd.sha1), hd.sha1, self._offset)
                    self._offset += HeaderSize + hd.plen
                else:
                    break
            del hd

    def add_data(self, datafile, data):
        datasize = len(data)
        sha1 = hashlib.sha1()
        sha1.update(data)
        digest = sha1.digest()

        if digest in self._toc:
            log.debug('add_data: %s added already', digest)
            return None

        log.debug('add_data: %s, %s bytes, %s', datafile, datasize, digest)
        with CtsMap(Header, self._mm, self._offset) as hd:
            hd.id = Header.Identifier
            hd.hlen = HeaderSize
            hd.plen = datasize
            hd.name = datafile
            hd.sha1 = digest
        del hd
        self._offset += HeaderSize

        log.debug('add_data: %s', datasize)
        blktype = ctypes.c_char * datasize
        with CtsMap(blktype, self._mm, self._offset) as blk:
            blk.raw = data
        del blk
        self._offset += datasize
        return HeaderSize + datasize

    def close(self):
        self._mm.close()
        self._fd.close()


if __name__ == '__main__':
    import os
    import sys

    logconfig = dict(
        level = logging.DEBUG,
        format = '%(levelname)5s: %(message)s',
    )
    logging.basicConfig(**logconfig)

    mf = MapFile('mapfile')
    for datafile in sys.argv[1:]:
        if os.path.isfile(datafile):
            try:
                data = open(datafile, 'rb').read()
            except OSError:
                continue
            else:
                mf.add_data(datafile.encode('utf-8'), data)
    mf.close()

运行:python3 hashable_ctypes_bytes.py somefiles*

第二次调用它,它使用sha1摘要作为键,通读文件,将有序字典中的所有项目收集在一起.不幸的是,c_char数组的语义有点局限,因为它的表现为类似于'\ 0'终止的c字符串,导致此处的校验和被截断.

Invoking it a second time, it reads through the file collecting all items in an ordered dict, using the sha1 digest as key. Unfortunately, the c_char array semantics are a little wired, because it also behaves like '\0' terminated c string, resulting in truncated checksums here.

请参见第3行和第4行:

See line 3 and 4:

DEBUG: toc: [20]b'\xcd0\xd7\xd3\xbf\x9f\xe1\xfe\xffr\xa6g#\xee\xf8\x84\xb5S,u': 0
DEBUG: toc: [20]b'\xe9\xfe\x1a;i\xcdG0\x84\x1b\r\x7f\xf9\x14\x868\xbdVl\x8d': 1273
DEBUG: toc: [19]b'\xa2\xdb\xff$&\xfe\x0f\xb4\xcaB<F\x92\xc0\xf1`(\x96N': 3642
DEBUG: toc: [15]b'O\x1b~c\x82\xeb)\x8f\xb5\x9c\x15\xd5e:\xa9': 4650
DEBUG: toc: [20]b'\x80\xe9\xbcF\x97\xdc\x93DG\x90\x19\x8c\xca\xfep\x05\xbdM\xfby': 13841
DEBUG: add_data: b'\xcd0\xd7\xd3\xbf\x9f\xe1\xfe\xffr\xa6g#\xee\xf8\x84\xb5S,u' added already
DEBUG: add_data: b'\xe9\xfe\x1a;i\xcdG0\x84\x1b\r\x7f\xf9\x14\x868\xbdVl\x8d' added already
DEBUG: add_data: b'../python/tmp/colorselect.py', 848 bytes, b'\xa2\xdb\xff$&\xfe\x0f\xb4\xcaB<F\x92\xc0\xf1`(\x96N\x00'
DEBUG: add_data: 848
DEBUG: add_data: b'../python/tmp/DemoCode.py', 9031 bytes, b'O\x1b~c\x82\xeb)\x8f\xb5\x9c\x15\xd5e:\xa9\x00p\x0f\xc04'
DEBUG: add_data: 9031
DEBUG: add_data: b'\x80\xe9\xbcF\x97\xdc\x93DG\x90\x19\x8c\xca\xfep\x05\xbdM\xfby' added already

通常的建议是将c_char * 20替换为c_byte * 20,并以此方式丢失透明字节的处理.除了数据转换麻烦之外,由于c_byte数组是字节数组,因此不可散列.我一直没有找到一种切实可行的解决方案,而无需一路来回经历繁重的转换麻烦,或者依靠十六进制,这会使sha1摘要大小要求加倍.

Usual suggestion is replace the c_char * 20 with c_byte * 20, and loosing the transparent bytes handling on that way. Apart from the data conversion hassles, c_byte arrays aren't hashable due to being bytearrays. I haven't found a practical solution without going through heavy conversion troubles all the way back and forth, or resorting to hexdigests, that doubles the sha1 digest size requirements.

我认为,将c_char与C零终止语义相混合的设计决定首先是错误的.为了解决这个问题,我可以想象在ctypes中添加一个c_char_nz类型,从而解决了这个问题.

I think, that the c_char design decision to mix it with C zero termination semantics was a fault in the first place. To cope with this, I could imagine to add a c_char_nz type to ctypes, that resolves this issue.

对于那些仔细阅读代码的人,您可能想知道ctypes结构的del语句.可以在此处找到它的讨论:.

For those of you, that read the code carefully, you might wonder about the del statements of the ctypes structures. A discussion of it can be found here:.

推荐答案

虽然下面的代码正在执行您提到的来回转换,但它确实很好地隐藏了问题.我发现一个包含空值的哈希,并且该字段现在可以用作字典键.希望对您有所帮助.

While the below code is doing the back-and-forth conversion you mentioned, it does nicely hide the issue. I found a hash that contained a null and the field can now be used as a dictionary key. Hope it helps.

from ctypes import *
import hashlib

class Test(Structure):
    _fields_ = [('_sha1',c_ubyte * 20)]

    @property
    def sha1(self):
        return bytes(self._sha1)

    @sha1.setter
    def sha1(self, value):
        self._sha1 = (c_ubyte * 20)(*value)

test = Test()
test.sha1 = hashlib.sha1(b'aaaaaaaaaaa').digest()
D = {test.sha1:0}
print(D)

输出:

{b'u\\\x00\x1fJ\xe3\xc8\x84>ZP\xddj\xa2\xfa#\x89=\xd3\xad': 0}

这篇关于Python:ctypes可散列的c_char数组替换,而不会导致'\ 0'字节跳闸的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆