子进程,从 STDOUT 读取时重复写入 STDIN (Windows) [英] Subprocess, repeatedly write to STDIN while reading from STDOUT (Windows)

查看:34
本文介绍了子进程,从 STDOUT 读取时重复写入 STDIN (Windows)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从 python 调用外部进程.我正在调用的进程读取一个输入字符串并给出标记化的结果,并等待另一个输入(如果有帮助,二进制是 MeCab 标记器).

I want to call an external process from python. The process I'm calling reads an input string and gives tokenized result, and waits for another input (binary is MeCab tokenizer if that helps).

我需要通过调用这个过程来标记数千行字符串.

I need to tokenize thousands of lines of string by calling this process.

问题是Popen.communicate()工作,但在给出 STDOUT 结果之前等待进程终止.我不想不断关闭和打开新的子流程数千次.(而且我不想发送整个文本,它将来很容易超过数万行.)

Problem is Popen.communicate() works but waits for the process to die before giving out the STDOUT result. I don't want to keep closing and opening new subprocesses for thousands of times. (And I don't want to send the whole text, it may easily grow over tens of thousands of -long- lines in future.)

from subprocess import PIPE, Popen

with Popen("mecab -O wakati".split(), stdin=PIPE,
           stdout=PIPE, stderr=PIPE, close_fds=False,
           universal_newlines=True, bufsize=1) as proc:
    output, errors = proc.communicate("foobarbaz")

print(output)

我试过阅读 proc.stdout.read() 而不是使用通信,但它被 stdin 阻止并且在 proc.stdin.close() 被调用.这又意味着我每次都需要创建一个新流程.

I've tried reading proc.stdout.read() instead of using communicate but it is blocked by stdin and doesn't return any results before proc.stdin.close() is called. Which, again means I need to create a new process everytime.

我尝试从如下类似的问题中实现队列和线程,但它要么不返回任何内容,因此它卡在 While True 上,或者当我强制 stdin 缓冲区填充时重复发送字符串,一次输出所有结果.

I've tried to implement queues and threads from a similar question as below, but it either doesn't return anything so it's stuck on While True, or when I force stdin buffer to fill by repeteadly sending strings, it outputs all the results at once.

from subprocess import PIPE, Popen
from threading import Thread
from queue import Queue, Empty

def enqueue_output(out, queue):
    for line in iter(out.readline, b''):
        queue.put(line)
    out.close()

p = Popen('mecab -O wakati'.split(), stdout=PIPE, stdin=PIPE,
          universal_newlines=True, bufsize=1, close_fds=False)
q = Queue()
t = Thread(target=enqueue_output, args=(p.stdout, q))
t.daemon = True
t.start()

p.stdin.write("foobarbaz")
while True:
    try:
        line = q.get_nowait()
    except Empty:
        pass
    else:
        print(line)
        break

还查看了 Pexpect 路由,但它的 windows 端口不支持一些重要的模块(基于 pty 的模块),所以我也无法应用.

Also looked at the Pexpect route, but it's windows port doesn't support some important modules (pty based ones), so I couldn't apply that as well.

我知道有很多类似的答案,而且我已经尝试了其中的大部分.但我尝试过的任何东西似乎都无法在 Windows 上运行.

I know there are a lot of similar answers, and I've tried most of them. But nothing I've tried seems to work on Windows.

有关我正在使用的二进制文件的一些信息,当我通过命令行使用它时.它运行并标记我给出的句子,直到我完成并强行关闭程序.

some info on the binary I'm using, when I use it via command line. It runs and tokenizes sentences I give, until I'm done and forcibly close the program.

(...waits_for_input -> input_recieved -> output -> waits_for_input...)

谢谢.

推荐答案

如果 mecab 使用带有默认缓冲的 C FILE 流,则管道标准输出有一个 4 KiB 的缓冲区.这里的想法是程序可以有效地使用小的、任意大小的读取和写入缓冲区,并且底层标准 I/O 实现处理自动填充和刷新更大的缓冲区.这最大限度地减少了所需的系统调用数量并最大限度地提高了吞吐量.显然,您不希望这种行为用于交互式控制台或终端 I/O 或写入 stderr.在这些情况下,C 运行时使用行缓冲或不使用缓冲.

If mecab uses C FILE streams with default buffering, then piped stdout has a 4 KiB buffer. The idea here is that a program can efficiently use small, arbitrary-sized reads and writes to the buffers, and the underlying standard I/O implementation handles automatically filling and flushing the much-larger buffers. This minimizes the number of required system calls and maximizes throughput. Obviously you don't want this behavior for interactive console or terminal I/O or writing to stderr. In these cases the C runtime uses line-buffering or no buffering.

程序可以覆盖此行为,有些程序确实具有用于设置缓冲区大小的命令行选项.例如,Python 具有-u"(无缓冲)选项和 PYTHONUNBUFFERED 环境变量.如果 mecab 没有类似的选项,则 Windows 上没有通用的解决方法.C 运行时情况太复杂了.Windows 进程可以静态或动态链接到一个或多个 CRT.Linux 上的情况不同,因为 Linux 进程通常将单个系统 CRT(例如 GNU libc.so.6)加载到全局符号表中,这允许 LD_PRELOAD 库配置 C FILE 流.Linux stdbuf 使用这个技巧,例如stdbuf -o0 mecab -O wakati.

A program can override this behavior, and some do have command-line options to set the buffer size. For example, Python has the "-u" (unbuffered) option and PYTHONUNBUFFERED environment variable. If mecab doesn't have a similar option, then there isn't a generic workaround on Windows. The C runtime situation is too complicated. A Windows process can link statically or dynamically to one or several CRTs. The situation on Linux is different since a Linux process generally loads a single system CRT (e.g. GNU libc.so.6) into the global symbol table, which allows an LD_PRELOAD library to configure the C FILE streams. Linux stdbuf uses this trick, e.g. stdbuf -o0 mecab -O wakati.

实验的一种选择是调用 CreateConsoleScreenBuffer 并从 msvcrt.open_osfhandle 获取句柄的文件描述符.然后将其作为 stdout 传递而不是使用管道.子进程会将其视为 TTY 并使用行缓冲而不是完整缓冲.然而,管理这一点并非易事.这将涉及读取(即 ReadConsoleOutputCharacter)另一个进程主动写入的滑动缓冲区(调用 GetConsoleScreenBufferInfo 以跟踪光标位置).这种互动不是我曾经需要甚至尝试过的.但是我以非交互方式使用了控制台屏幕缓冲区,即在孩子退出后读取缓冲区.这允许从直接写入控制台而不是 stdout 的程序中读取多达 9,999 行的输出,例如调用 WriteConsole 或打开CON"或CONOUT$"的程序.

One option to experiment with is to call CreateConsoleScreenBuffer and get a file descriptor for the handle from msvcrt.open_osfhandle. Then pass this as stdout instead of using a pipe. The child process will see this as a TTY and use line buffering instead of full buffering. However managing this is non-trivial. It would involve reading (i.e. ReadConsoleOutputCharacter) a sliding buffer (call GetConsoleScreenBufferInfo to track the cursor position) that's actively written to by another process. This kind of interaction isn't something that I've ever needed or even experimented with. But I have used a console screen buffer non-interactively, i.e. reading the buffer after the child has exited. This allows reading up to 9,999 lines of output from programs that write directly to the console instead of stdout, e.g. programs that call WriteConsole or open "CON" or "CONOUT$".

这篇关于子进程,从 STDOUT 读取时重复写入 STDIN (Windows)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆