是否有可能恢复损坏的“内部”字节对象 [英] Is it possible to restore corrupted “interned” bytes-objects

查看：89 发布时间：2020/10/11 1:04:22 python python-3.x cpython python-internals

本文介绍了是否有可能恢复损坏的“内部”字节对象的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

众所周知，小的 bytes 对象是由CPython自动插入的（类似于 intern -字符串函数）。更正：正如@abarnert解释的，它比整数字符串更像整数池

是否有可能通过一个实验性第三方库破坏被阻止的字节对象，或者是重启内核的唯一方法？

可以使用Cython功能进行概念验证（Cython> = 0.28）：

  %% cython 
 def do_bad_things（）：
 cdef字节b = b'a'
 cdef const无符号字符[：] safe = b 
 cdef char *不安全=< char *> & safe [0]＃谁仍然需要const和类型安全？ 
不安全[0] = 98＃替换为`b`

或按@的建议通过 ctypes

进行jfs：

 导入ctypes 
 import sys 
 def do_bad_things（）：
b = b'a'; 
（ctypes.c_ubyte * sys.getsizeof（b））。from_address（id（b））[-2] = 98

显然，通过滥用C功能， do_bad_things 更改了不可变的对象（或CPython认为）的对象 b'a '到 b'b'，并且由于此 bytes 对象已被禁闭，看到坏事随后发生：

 >> do_bad_things（）＃b’a’表示现在b’b’
>>> b’a’== b’b’＃等待惊喜
真实
>>> print（b'a'）＃另外一个
 b'b'

有可能恢复/清除字节对象池，以便 b'a'再次表示 b'a' ？

一个小注意：似乎不是每个个字节创建过程正在使用此池。例如：

 >> do_bad_things（）
>>打印（b’a’）
 b’b’
>>> print（（97）.to_bytes（1，byteorder ='little'））#ord（'a'）= 97 
 b'a'

解决方案

Python 3不会像 bytes 对象那样实现<$ c c $ c> str 。相反，它会像使用 int 一样保持它们的静态数组。

盖子。不利的一面是，它没有要使用的表（带有API）。从正面看，这意味着如果可以找到静态数组，则可以像对int一样进行修复，因为数组索引和字符串的字符值应该是相同的。

如果您查看 bytesobject.c ，该数组在顶部声明：

 静态PyBytesObject *字符[UCHAR_MAX + 1];

...然后，例如，在 PyBytes_FromStringAndSize ：

  if（size == 1&& str！= NULL&& 
（op =字符[* str& UCHAR_MAX]）！= NULL）
 {
 #ifdef COUNT_ALLOCS 
 one_strings ++; 
 #endif 
 Py_INCREF（op）; 
 return（PyObject *）op; 
}

请注意，数组为静态，因此无法从此文件外部访问它，并且它仍在重新引用对象，因此调用者（甚至是解释器中的内部内容，更不用说您的C API扩展）也不能告诉您有什么特别的事情。 / p>

因此，没有正确的方法来清理它。

但是，如果您想变黑的话……

如果您引用了任何一个单字符字节，并且知道它应该是哪个字符，则可以到达数组的开头，然后然后清理整个内容。

除非您搞砸了，甚至超出您的想象，否则您只能构建一个1个字符的 bytes 并减去应该为的字符。 PyBytes_FromStringAndSize（ a，1）将返回假定的对象为'a'，即使实际上恰好是持有'b'。我们怎么知道呢？因为那正是您要解决的问题。

实际上，可能有一些方法可以使事情变得更糟……这似乎都不太可能，但是安全，让我们使用一个比 a 不太可能损坏的字符，例如 \x80 ：

  PyBytesObject * byte80 =（PyBytesObject *）PyBytes_FromStringAndSize（ \x80，1）; 
 PyBytesObject *字符=字节80-0x80;

唯一的警告是，如果您尝试使用从Python进行此操作ctypes 而不是C代码，它需要格外小心，¹，但是由于您没有使用 ctypes ，不用担心。

因此，现在我们有了一个指向字符的指针，我们可以对其进行遍历。我们不能只是删除对象以取消intern它们，因为这会使引用了其中任何对象的任何人陷入困境，并可能导致段错误。但是我们没有必要。表中的任何对象，我们都知道它应该是什么- characters [i] 应该是一个1个字符的 bytes ，其一个字符为 i 。因此，只需将其设置为该值，并使用如下循环：

  for（size_t char i = 0; i！= UCHAR_MAX; i ++）{
 if（characters [i]）{
 //做与您首先将字符串打断相同的hacky东西
} 
}

这就是全部。

嗯，除了编译。²

幸运的是，在交互式解释器中，每个完整的top- level语句是它自己的编译单元，因此…运行修复程序后，键入任何新行都应该可以。

但是您导入的模块已经在字符串断裂的情况下进行编译？您可能已经搞砸了它的常数。而且，除了强制重新编译和重新导入每个模块外，我想不出一种清除此漏洞的好方法。

子> 1。编译器甚至可能在进入C调用之前将您的 b’\x80’参数变成错误的东西。而且，您会以为您认为绕过 c_char_p 的所有地方都感到惊讶，而实际上它正在神奇地转换为个字节和从这些字节中转换。最好使用 POINTER（c_uint8）。

_{2。如果编译的代码中包含 b'a'，则consts数组应引用 b'a'，它将得到修复。但是，由于已知 bytes 对于编译器是不可变的，因此如果它知道 b'a'== b'b'，实际上它可能存储指向 b'b'单例的指针，原因与 123456是123456 是正确的，在这种情况下，修复 b'a'可能无法真正解决问题。}

It is well known, that small bytes-objects are automatically "interned" by CPython (similar to the intern-function for strings). Correction: As explained by @abarnert it is more like the integer-pool than the interned strings.

Is it possible to restore the interned bytes-objects after they have been corrupted by let's say an "experimental" third party library or is the only way to restart the kernel?

The proof of concept can be done with Cython-functionality (Cython>=0.28):

%%cython
def do_bad_things():
   cdef bytes b=b'a'
   cdef const unsigned char[:] safe=b  
   cdef char *unsafe=<char *> &safe[0]   #who needs const and type-safety anyway?
   unsafe[0]=98                          #replace through `b`

or as suggested by @jfs through ctypes:

import ctypes
import sys
def do_bad_things():
    b = b'a'; 
    (ctypes.c_ubyte * sys.getsizeof(b)).from_address(id(b))[-2] = 98

Obviously, by misusing C-functionality, do_bad_things changes immutable (or so the CPython thinks) object b'a' to b'b' and because this bytes-object is interned, we can see bad things happen afterwards:

>>> do_bad_things() #b'a' means now b'b'
>>> b'a'==b'b'  #wait for a surprise  
True
>>> print(b'a') #another one
b'b'

It is possible to restore/clear the byte-object-pool, so that b'a' means b'a' once again?

A little side note: It seems as if not every bytes-creation process is using this pool. For example:

>>> do_bad_things()
>>> print(b'a')
b'b'
>>> print((97).to_bytes(1, byteorder='little')) #ord('a')=97
b'a'

解决方案

Python 3 doesn't intern bytes objects the way it does str. Instead, it keeps a static array of them the way it does with int.

This is very different under the covers. On the down side, it means there's no table (with an API) to be manipulated. On the up side, it means that if you can find the static array, you can fix it, the same way you would for ints, because the array index and the character value of the string are supposed to be identical.

If you look in bytesobject.c, the array is declared at the top:

static PyBytesObject *characters[UCHAR_MAX + 1];

… and then, for example, within PyBytes_FromStringAndSize:

if (size == 1 && str != NULL &&
    (op = characters[*str & UCHAR_MAX]) != NULL)
{
#ifdef COUNT_ALLOCS
    one_strings++;
#endif
    Py_INCREF(op);
    return (PyObject *)op;
}

Notice that the array is static, so it's not accessible from outside this file, and that it's still refcounting the objects, so callers (even internal stuff in the interpreter, much less your C API extension) can't tell that there's anything special going on.

So, there's no "correct" way to clean this up.

But if you want to get hacky…

If you have a reference to any of the single-char bytes, and you know which character it was supposed to be, you can get to the start of the array and then clean up the whole thing.

Unless you've screwed up even more than you think, you can just construct a one-char bytes and subtract the character it was supposed to be. PyBytes_FromStringAndSize("a", 1) is going to return the object that's supposed to be 'a', even if it happens to actually hold 'b'. How do we know that? Because that's exactly the problem that you're trying to fix.

Actually, there are probably ways you could break things even worse… which all seem very unlikely, but to be safe, let's use a character you're less likely to have broken than a, like \x80:

PyBytesObject *byte80 = (PyBytesObject *)PyBytes_FromStringAndSize("\x80", 1);
PyBytesObject *characters = byte80 - 0x80;

The only other caveat is that if you try to do this from Python with ctypes instead of from C code, it would require some extra care,¹ but since you're not using ctypes, let's not worry about that.

So, now we have a pointer to characters, we can walk it. We can't just delete the objects to "unintern" them, because that will hose anyone who has a reference to any of them, and probably lead to a segfault. But we don't have to. Any object that's in the table, we know what it's supposed to be—characters[i] is supposed to be a one-char bytes whose one character is i. So just set it back to that, with a loop something like this:

for (size_t char i=0; i!=UCHAR_MAX; i++) {
    if (characters[i]) {
        // do the same hacky stuff you did to break the string in the first place
    }
}

That's all there is to it.

Well, except for compilation.²

Fortunately, at the interactive interpreter, each complete top-level statement is its own compilation unit, so… you should be OK with any new line you type after running the fix.

But a module you've imported, that had to be compiled, while you had the broken strings? You've probably screwed up its constants. And I can't think of a good way to clean this up except to forcibly recompile and reimport every module.

_{1. The compiler might turn your b'\x80' argument into the wrong thing before it even gets to the C call. And you'd be surprised at all the places you think you're passing around a c_char_p and it's actually getting magically converted to and from bytes. Probably better to use a POINTER(c_uint8).}

_{2. If you compiled some code with b'a' in it, the consts array should have a reference to b'a', which will get fixed. But, since bytes are known immutable to the compiler, if it knows that b'a' == b'b', it may actually store the pointer to the b'b' singleton instead, for the same reason that 123456 is 123456 is true, in which case fixing b'a' may not actually solve the problem.}

这篇关于是否有可能恢复损坏的“内部”字节对象的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

是否有可能恢复损坏的“内部”字节对象 [英] Is it possible to restore corrupted “interned” bytes-objects

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

是否有可能恢复损坏的“内部”字节对象 [英] Is it possible to restore corrupted “interned” bytes-objects

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭