Python 流提取 [英] Python Stream Extraction

查看:104
本文介绍了Python 流提取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

许多编程语言的标准库包括一个扫描器 API",用于从文本输入流中提取字符串、数字或其他对象.(例如Java包含Scanner类,C++包含istream,C包含scanf).

Python 中的 this 是什么?

Python 有一个流接口,即继承自 io.IOBase 的类.然而,Python TextIOBase 流接口只提供面向行输入的工具.在阅读文档在 Google 上搜索,我在标准 Python 模块中找不到可以让我例如从文本流中提取整数或提取的内容下一个以空格分隔的单词作为字符串.是否有任何标准设施可以做到这一点?

解决方案

没有与 fscanf 或 Java 的 Scanner 等效的东西.最简单的解决方案是要求用户使用换行分隔输入而不是空格分隔输入,然后您可以逐行读取并将行转换为正确的类型.

如果您希望用户提供更多结构化的输入,那么您可能应该为用户输入创建一个解析器.有一些很好的 Python 解析库,例如 pyparsing.还有一个scanf 模块,甚至虽然最后一次更新是 2008 年.

如果你不想有外部依赖,那么你可以使用正则表达式来匹配输入序列.当然,正则表达式需要处理字符串,但您可以轻松克服这种以块为单位读取的限制.例如,这样的事情在大多数情况下应该运行良好:

导入重新FORMATS_TYPES = {'d':整数,'f':浮动,'s':str,}FORMATS_REGEXES = {'d': re.compile(r'(?:\s|\b)*([+-]?\d+)(?:\s|\b)*'),'f': re.compile(r'(?:\s|\b)*([+-]?\d+\.?\d*)(?:\s|\b)*'),'s': re.compile(r'\b(\w+)\b'),}FORMAT_FIELD_REGEX = re.compile(r'%(s|d|f)')def scan_input(format_string, stream, max_size=float('+inf'), chunk_size=1024):"""扫描输入流并检索格式化的输入."""块 = ''format_fields = format_string.split()[::-1]而格式字段:字段 = FORMAT_FIELD_REGEX.findall(format_fields.pop())如果不是块:块 = _get_chunk(流,块大小)对于字段中的字段:field_regex = FORMATS_REGEXES[字段]匹配 = field_regex.search(chunk)length_before = len(块)而 match 为 None 或 match.end() >= len(chunk):块 += _get_chunk(流,块大小)如果不是块或 length_before == len(chunk):如果匹配为无:raise ValueError('缺少字段.')休息文本 = match.group(1)产量 FORMATS_TYPES[field](text)块 = 块[match.end():]def _get_chunk(stream, chunk_size):尝试:返回 stream.read(chunk_size)除了EOFError:返回 '​​'

示例用法:

<预><代码>>>>s = StringIO('1234 Hello World -13.48 -678 12.45')>>>对于 scan_input('%d %s %s %f %d %f', s) 中的数据:打印 repr(data)...1234'你好''世界'-13.48-67812.45

您可能需要对其进行扩展,并对其进行适当的测试,但它应该会给您一些想法.

The standard library of many programming languages includes a "scanner API" to extract strings, numbers, or other objects from text input streams. (For example, Java includes the Scanner class, C++ includes istream, and C includes scanf).

What is the equivalent of this in Python?

Python has a stream interface, i.e. classes that inherit from io.IOBase. However, the Python TextIOBase stream interface only provides facilities for line-oriented input. After reading the documentation and searching on Google, I can't find something in the standard Python modules that would let me, for example, extract an integer from a text stream, or extract the next space-delimited word as a string. Are there any standard facilities to do this?

解决方案

There is no equivalent of fscanf or Java's Scanner. The simplest solution is to require the user to use newline separeted input instead of space separated input, you can then read line by line and convert the lines to the correct type.

If you want the user to provide more structured input then you probably should create a parser for the user input. There are some nice parsing libraries for python, for example pyparsing. There is also a scanf module, even though the last update is of 2008.

If you don't want to have external dependencies then you can use regexes to match the input sequences. Certainly regexes require to work on strings, but you can easily overcome this limitation reading in chunks. For example something like this should work well most of the time:

import re


FORMATS_TYPES = {
    'd': int,
    'f': float,
    's': str,
}


FORMATS_REGEXES = {    
    'd': re.compile(r'(?:\s|\b)*([+-]?\d+)(?:\s|\b)*'),
    'f': re.compile(r'(?:\s|\b)*([+-]?\d+\.?\d*)(?:\s|\b)*'),
    's': re.compile(r'\b(\w+)\b'),
}


FORMAT_FIELD_REGEX = re.compile(r'%(s|d|f)')


def scan_input(format_string, stream, max_size=float('+inf'), chunk_size=1024):
    """Scan an input stream and retrieve formatted input."""

    chunk = ''
    format_fields = format_string.split()[::-1]
    while format_fields:
        fields = FORMAT_FIELD_REGEX.findall(format_fields.pop())
        if not chunk:
            chunk = _get_chunk(stream, chunk_size)

        for field in fields:
            field_regex = FORMATS_REGEXES[field]
            match = field_regex.search(chunk)
            length_before = len(chunk)
            while match is None or match.end() >= len(chunk):
                chunk += _get_chunk(stream, chunk_size)
                if not chunk or length_before == len(chunk):
                    if match is None:
                        raise ValueError('Missing fields.')
                    break
            text = match.group(1)
            yield FORMATS_TYPES[field](text)
            chunk = chunk[match.end():]



def _get_chunk(stream, chunk_size):
    try:
        return stream.read(chunk_size)
    except EOFError:
        return ''

Example usage:

>>> s = StringIO('1234 Hello World -13.48 -678 12.45')
>>> for data in scan_input('%d %s %s %f %d %f', s): print repr(data)
...                                                                                            
1234                                                                                           
'Hello'
'World'
-13.48
-678
12.45

You'll probably have to extend this, and test it properly but it should give you some ideas.

这篇关于Python 流提取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆