是否有可能“破解"?Python的打印功能? [英] Is it possible to "hack" Python's print function?

查看:50
本文介绍了是否有可能“破解"?Python的打印功能?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

注意:此问题仅供参考.我很想知道使用它可以深入了解 Python 的内部结构.

不久前,在某个问题内部开始讨论传递给打印语句的字符串是否可以修改在调用 print 之后/期间.例如,考虑函数:

def print_something():print('这只猫很害怕.')

现在,当 print 运行时,终端的输出应该显示:

这只狗很害怕.

请注意,cat"一词已被dog"一词取代.某处某处能够以某种方式修改这些内部缓冲区以更改打印的内容.假设这是在没有原始代码作者明确许可的情况下完成的(因此,黑客/劫持).

这个 评论 来自明智的@abernert,特别是让我思考:

<块引用>

有几种方法可以做到这一点,但它们都非常丑陋,而且永远不应该做.最丑陋的方法可能是更换函数内的 code 对象具有不同的 co_consts列表.接下来可能是访问 C API 以访问 str 的内部缓冲区.[...]

所以,看起来这实际上是可能的.

这是我解决这个问题的幼稚方法:

<预><代码>>>>进口检验>>>exec(inspect.getsource(print_something).replace('cat', 'dog'))>>>打印_某事()这只狗被吓坏了.

当然,exec 很糟糕,但这并不能真正回答问题,因为它实际上并没有在 when/after print 被调用.

如@abernert 所解释的那样,它将如何完成?

解决方案

首先,实际上有一个更简单的方法.我们要做的就是改变 print 打印的内容,对吧?

_print = 打印def 打印(*args, **kw):args = (arg.replace('cat', 'dog') if isinstance(arg, str) else arg对于 args 中的 arg)_print(*args, **kw)

或者,类似地,您可以使用monkeypatch sys.stdout 而不是print.

<小时>

此外,exec ... getsource ... 的想法没有错.嗯,当然有很多错误,但比这里少......

<小时>

但是如果您确实想修改函数对象的代码常量,我们可以这样做.

如果你真的想真正玩转代码对象,你应该使用像 bytecode<这样的库/code>(完成后)或 byteplay(直到然后,或对于较旧的 Python 版本)而不是手动执行.即使对于这种微不足道的事情,CodeType 初始值设定项也很痛苦;如果您确实需要执行诸如修复 lnotab 之类的事情,那么只有疯子才会手动完成.

此外,不用说并非所有 Python 实现都使用 CPython 样式的代码对象.这段代码将在 CPython 3.7 中工作,并且可能所有版本都至少回到 2.2 并进行一些小的更改(不是代码黑客的东西,而是生成器表达式之类的东西),但它不适用于任何版本的 IronPython.

导入类型def print_function():打印(这只猫很害怕.")定义主():# 函数对象是代码对象的包装器,具有# 一些额外的东西,比如默认值和闭包单元.# 有关更多详细信息,请参阅检查模块文档.co = print_function.__code__# 一个代码对象是一个字节码字符串的包装器,带有一个# 一大堆额外的东西,包括使用的常量列表# 通过那个字节码.再次查看检查模块文档.反正里面# 字符串的字节码(您可以通过键入读取# dis.dis(string) in your REPL),将会有一个# 像 LOAD_CONST 1 这样的指令将字符串文字加载到# 传递给打印函数的堆栈,这只是通过# 阅读 co.co_consts[1].所以,这就是我们想要改变的.consts = tuple(c.replace("cat", "dog") if isinstance(c, str) else c对于 co.co_consts 中的 c)# 不幸的是,代码对象是不可变的,所以我们必须创建# 一个新的,复制除了 co_consts 之外的所有内容,其中# 我们将替换.并且初始化器有无数个参数.# 在 REPL 中尝试 help(types.CodeType) 以查看整个列表.co = types.CodeType(co.co_argcount、co.co_kwonlyargcount、co.co_nlocals、co.co_stacksize、co.co_flags、co.co_code、常量,co.co_names,co.co_varnames,co.co_filename,co.co_name, co.co_firstlineno, co.co_lnotab,co.co_freevars, co.co_cellvars)print_function.__code__ = co打印功能()主要的()

破解代码对象会出什么问题?大多数情况下只是段错误、占用整个堆栈的 RuntimeError、可以处理的更正常的 RuntimeError,或者可能只会引发 TypeErrorAttributeError 当您尝试使用它们时.例如,尝试创建一个只有 RETURN_VALUE 堆栈上没有任何内容的代码对象(字节码 b'S\0' for 3.6+, b'S'之前),或者当字节码中有 LOAD_CONST 0co_consts 的空元组,或者 varnames 减 1,所以最高 LOAD_FAST 实际上加载了一个 freevar/cellvar 单元.为了一些真正的乐趣,如果你的 lnotab 错误足够多,你的代码只会在调试器中运行时出现段错误.

使用 bytecodebyteplay 不会保护您免受所有这些问题的影响,但它们确实有一些基本的健全性检查,以及可以让您执行以下操作的好帮手插入一段代码,让它担心更新所有的偏移量和标签,这样你就不会出错,等等.(另外,它们使您不必输入那个可笑的 6 行构造函数,并且不必调试由此产生的愚蠢的拼写错误.)

<小时>

现在进入#2.

我提到代码对象是不可变的.当然,常量是一个元组,所以我们不能直接改变它.而 const 元组中的东西是一个字符串,我们也不能直接改变它.这就是为什么我必须构建一个新的字符串来构建一个新的元组来构建一个新的代码对象.

但是如果你可以直接改变一个字符串呢?

好吧,隐藏的足够深,一切都只是指向某些 C 数据的指针,对吗?如果您使用 CPython,则有 a用于访问对象的 C API您可以使用 ctypes 从 Python 本身内部访问该 API,这是一个糟糕的主意,以至于他们在 stdlib 的 ctypes 中放置了一个 pythonapi代码>模块.:) 您需要知道的最重要的技巧是 id(x) 是内存中指向 x 的实际指针(作为 int).

不幸的是,字符串的 C API 无法让我们安全地访问已经冻结的字符串的内部存储.所以放心,让我们阅读头文件并找到自己存储.

如果您使用 CPython 3.4 - 3.7(旧版本不同,谁知道未来),来自由纯 ASCII 组成的模块的字符串文字将使用紧凑的 ASCII 格式存储,即意味着结构提前结束并且 ASCII 字节的缓冲区紧跟在内存中.如果将非 ASCII 字符或某些类型的非文字字符串放入字符串中,这将中断(可能是段错误),但您可以阅读其他 4 种访问不同类型字符串的缓冲区的方法.

为了让事情变得简单一些,我使用了 superhackyinternals 项目关闭我的 GitHub.(它故意不能通过 pip 安装,因为你真的不应该使用它,除非你在本地构建解释器等等.)

import ctypes导入内部# https://github.com/abarnert/superhackyinternals/blob/master/internals.pydef print_function():打印(这只猫很害怕.")定义主():对于 print_function.__code__.co_consts 中的 c:如果 isinstance(c, str):idx = c.find('猫')如果 idx != -1:# 这里就不多解释了;只是猜测并学习# 喜欢段错误...p = internals.PyUnicodeObject.from_address(id(c))断言 p.compact 和 p.asciiaddr = id(c) + internals.PyUnicodeObject.utf8_length.offsetbuf = (ctypes.c_int8 * 3).from_address(addr + idx)buf[:3] = b'狗'打印功能()主要的()

如果你想玩这些东西,intstr 更简单.通过将 2 的值更改为 1 可以更容易地猜测你可以破坏什么,对吧?实际上,别想了,让我们去做吧(再次使用 superhackyinternals 中的类型):

<预><代码>>>>n = 2>>>pn = PyLongObject.from_address(id(n))>>>pn.ob_digit[0]2>>>pn.ob_digit[0] = 1>>>21>>>n * 33>>>我 = 10>>>当我 <40:...我*= 2...打印(一)101010

... 假设代码框有一个无限长度的滚动条.

我在 IPython 中尝试了同样的事情,当我第一次尝试在提示符处评估 2 时,它进入了某种不可中断的无限循环.大概它在其 REPL 循环中使用数字 2 来表示某些东西,而股票解释器不是?

Note: This question is for informational purposes only. I am interested to see how deep into Python's internals it is possible to go with this.

Not very long ago, a discussion began inside a certain question regarding whether the strings passed to print statements could be modified after/during the call to print has been made. For example, consider the function:

def print_something():
    print('This cat was scared.')

Now, when print is run, then the output to the terminal should display:

This dog was scared.

Notice the word "cat" has been replaced by the word "dog". Something somewhere somehow was able to modify those internal buffers to change what was printed. Assume this is done without the original code author's explicit permission (hence, hacking/hijacking).

This comment from the wise @abarnert, in particular, got me thinking:

There are a couple of ways to do that, but they're all very ugly, and should never be done. The least ugly way is to probably replace the code object inside the function with one with a different co_consts list. Next is probably reaching into the C API to access the str's internal buffer. [...]

So, it looks like this is actually possible.

Here's my naive way of approaching this problem:

>>> import inspect
>>> exec(inspect.getsource(print_something).replace('cat', 'dog'))
>>> print_something()
This dog was scared.

Of course, exec is bad, but that doesn't really answer the question, because it does not actually modify anything during when/after print is called.

How would it be done as @abarnert has explained it?

解决方案

First, there's actually a much less hacky way. All we want to do is change what print prints, right?

_print = print
def print(*args, **kw):
    args = (arg.replace('cat', 'dog') if isinstance(arg, str) else arg
            for arg in args)
    _print(*args, **kw)

Or, similarly, you can monkeypatch sys.stdout instead of print.


Also, nothing wrong with the exec … getsource … idea. Well, of course there's plenty wrong with it, but less than what follows here…


But if you do want to modify the function object's code constants, we can do that.

If you really want to play around with code objects for real, you should use a library like bytecode (when it's finished) or byteplay (until then, or for older Python versions) instead of doing it manually. Even for something this trivial, the CodeType initializer is a pain; if you actually need to do stuff like fixing up lnotab, only a lunatic would do that manually.

Also, it goes without saying that not all Python implementations use CPython-style code objects. This code will work in CPython 3.7, and probably all versions back to at least 2.2 with a few minor changes (and not the code-hacking stuff, but things like generator expressions), but it won't work with any version of IronPython.

import types

def print_function():
    print ("This cat was scared.")

def main():
    # A function object is a wrapper around a code object, with
    # a bit of extra stuff like default values and closure cells.
    # See inspect module docs for more details.
    co = print_function.__code__
    # A code object is a wrapper around a string of bytecode, with a
    # whole bunch of extra stuff, including a list of constants used
    # by that bytecode. Again see inspect module docs. Anyway, inside
    # the bytecode for string (which you can read by typing
    # dis.dis(string) in your REPL), there's going to be an
    # instruction like LOAD_CONST 1 to load the string literal onto
    # the stack to pass to the print function, and that works by just
    # reading co.co_consts[1]. So, that's what we want to change.
    consts = tuple(c.replace("cat", "dog") if isinstance(c, str) else c
                   for c in co.co_consts)
    # Unfortunately, code objects are immutable, so we have to create
    # a new one, copying over everything except for co_consts, which
    # we'll replace. And the initializer has a zillion parameters.
    # Try help(types.CodeType) at the REPL to see the whole list.
    co = types.CodeType(
        co.co_argcount, co.co_kwonlyargcount, co.co_nlocals,
        co.co_stacksize, co.co_flags, co.co_code,
        consts, co.co_names, co.co_varnames, co.co_filename,
        co.co_name, co.co_firstlineno, co.co_lnotab,
        co.co_freevars, co.co_cellvars)
    print_function.__code__ = co
    print_function()

main()

What could go wrong with hacking up code objects? Mostly just segfaults, RuntimeErrors that eat up the whole stack, more normal RuntimeErrors that can be handled, or garbage values that will probably just raise a TypeError or AttributeError when you try to use them. For examples, try creating a code object with just a RETURN_VALUE with nothing on the stack (bytecode b'S\0' for 3.6+, b'S' before), or with an empty tuple for co_consts when there's a LOAD_CONST 0 in the bytecode, or with varnames decremented by 1 so the highest LOAD_FAST actually loads a freevar/cellvar cell. For some real fun, if you get the lnotab wrong enough, your code will only segfault when run in the debugger.

Using bytecode or byteplay won't protect you from all of those problems, but they do have some basic sanity checks, and nice helpers that let you do things like insert a chunk of code and let it worry about updating all offsets and labels so you can't get it wrong, and so on. (Plus, they keep you from having to type in that ridiculous 6-line constructor, and having to debug the silly typos that come from doing so.)


Now on to #2.

I mentioned that code objects are immutable. And of course the consts are a tuple, so we can't change that directly. And the thing in the const tuple is a string, which we also can't change directly. That's why I had to build a new string to build a new tuple to build a new code object.

But what if you could change a string directly?

Well, deep enough under the covers, everything is just a pointer to some C data, right? If you're using CPython, there's a C API to access the objects, and you can use ctypes to access that API from within Python itself, which is such a terrible idea that they put a pythonapi right there in the stdlib's ctypes module. :) The most important trick you need to know is that id(x) is the actual pointer to x in memory (as an int).

Unfortunately, the C API for strings won't let us safely get at the internal storage of an already-frozen string. So screw safely, let's just read the header files and find that storage ourselves.

If you're using CPython 3.4 - 3.7 (it's different for older versions, and who knows for the future), a string literal from a module that's made of pure ASCII is going to be stored using the compact ASCII format, which means the struct ends early and the buffer of ASCII bytes follows immediately in memory. This will break (as in probably segfault) if you put a non-ASCII character in the string, or certain kinds of non-literal strings, but you can read up on the other 4 ways to access the buffer for different kinds of strings.

To make things slightly easier, I'm using the superhackyinternals project off my GitHub. (It's intentionally not pip-installable because you really shouldn't be using this except to experiment with your local build of the interpreter and the like.)

import ctypes
import internals # https://github.com/abarnert/superhackyinternals/blob/master/internals.py

def print_function():
    print ("This cat was scared.")

def main():
    for c in print_function.__code__.co_consts:
        if isinstance(c, str):
            idx = c.find('cat')
            if idx != -1:
                # Too much to explain here; just guess and learn to
                # love the segfaults...
                p = internals.PyUnicodeObject.from_address(id(c))
                assert p.compact and p.ascii
                addr = id(c) + internals.PyUnicodeObject.utf8_length.offset
                buf = (ctypes.c_int8 * 3).from_address(addr + idx)
                buf[:3] = b'dog'

    print_function()

main()

If you want to play with this stuff, int is a whole lot simpler under the covers than str. And it's a lot easier to guess what you can break by changing the value of 2 to 1, right? Actually, forget imagining, let's just do it (using the types from superhackyinternals again):

>>> n = 2
>>> pn = PyLongObject.from_address(id(n))
>>> pn.ob_digit[0]
2
>>> pn.ob_digit[0] = 1
>>> 2
1
>>> n * 3
3
>>> i = 10
>>> while i < 40:
...     i *= 2
...     print(i)
10
10
10

… pretend that code box has an infinite-length scrollbar.

I tried the same thing in IPython, and the first time I tried to evaluate 2 at the prompt, it went into some kind of uninterruptable infinite loop. Presumably it's using the number 2 for something in its REPL loop, while the stock interpreter isn't?

这篇关于是否有可能“破解"?Python的打印功能?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆