如何从python中的文件中读取由自定义分隔符终止的记录? [英] How to read records terminated by custom separator from file in python?

查看:34
本文介绍了如何从python中的文件中读取由自定义分隔符终止的记录?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想要一种在 python 中执行 for line in file 的方法,其中行尾被重新定义为我想要的任何字符串.另一种说法是我想从文件而不是行中读取记录;我希望它和阅读线一样快速和方便.

这个python相当于设置perl的$/输入记录分隔符,或者在java中使用Scanner.这不一定必须使用 for line in file (特别是,迭代器可能不是文件对象).只是一些等效的东西,可以避免将太多数据读入内存.

另见:向标准 IO 堆栈添加对使用任意分隔符读取记录的支持

解决方案

Python 2.x file 对象或 Python 3.3 io 类中没有任何内容,允许您为 readline 指定自定义分隔符.(for line in file 最终使用与 readline 相同的代码.)

但是自己构建它很容易.例如:

def delimited(file, delimiter='
', bufsize=4096):buf = ''而真:newbuf = file.read(bufsize)如果不是 newbuf:产量缓冲返回buf += newbuf行 = buf.split(分隔符)对于行中的行 [:-1]:屈服线buf = 行 [-1]

<小时>

这是一个愚蠢的例子:

<预><代码>>>>s = io.StringIO('abcZZZdefZZZghiZZZjklZZZmnoZZZpqr')>>>d = 分隔(s,'ZZZ',bufsize=2)>>>清单(d)['abc', 'def', 'ghi', 'jkl', 'mno', 'pqr']

<小时>

如果你想让它同时适用于二进制和文本文件,尤其是在 3.x 中,那就有点棘手了.但如果它只适用于一种或另一种(以及一种或另一种语言),您可以忽略它.

同样,如果您使用 Python 3.x(或使用 Python 2.x 中的 io 对象),并且想要利用已经在 BufferedIOBase 而不是仅仅在缓冲区顶部放置一个缓冲区,这更棘手.io 文档确实解释了如何做所有事情......但我没有知道任何简单的例子,所以你真的必须至少阅读该页面的一半并浏览其余部分.(当然,您可以直接使用原始文件……但如果您想找到 unicode 分隔符,则不能……)

I would like a way to do for line in file in python, where the end of line is redefined to be any string that I want. Another way of saying that is I want to read records from file rather than lines; I want it to be equally fast and convenient to do as reading lines.

This is the python equivalent to setting perl's $/ input record separator, or using Scanner in java. This doesn't necessarily have to use for line in file (in particular, the iterator may not be a file object). Just something equivalent which avoids reading too much data into memory.

See also: Add support for reading records with arbitrary separators to the standard IO stack

解决方案

There is nothing in the Python 2.x file object, or the Python 3.3 io classes, that lets you specify a custom delimiter for readline. (The for line in file is ultimately using the same code as readline.)

But it's pretty easy to build it yourself. For example:

def delimited(file, delimiter='
', bufsize=4096):
    buf = ''
    while True:
        newbuf = file.read(bufsize)
        if not newbuf:
            yield buf
            return
        buf += newbuf
        lines = buf.split(delimiter)
        for line in lines[:-1]:
            yield line
        buf = lines[-1]


Here's a stupid example of it in action:

>>> s = io.StringIO('abcZZZdefZZZghiZZZjklZZZmnoZZZpqr')
>>> d = delimited(s, 'ZZZ', bufsize=2)
>>> list(d)
['abc', 'def', 'ghi', 'jkl', 'mno', 'pqr']


If you want to get it right for both binary and text files, especially in 3.x, it's a bit trickier. But if it only has to work for one or the other (and one language or the other), you can ignore that.

Likewise, if you're using Python 3.x (or using io objects in Python 2.x), and want to make use of the buffers that are already being maintained in a BufferedIOBase instead of just putting a buffer on top of the buffer, that's trickier. The io docs do explain how to do everything… but I don't know of any simple examples, so you're really going to have to read at least half of that page and skim the rest. (Of course, you could just use the raw files directly… but not if you want to find unicode delimiters…)

这篇关于如何从python中的文件中读取由自定义分隔符终止的记录?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆