如何从python中的文件中读取自定义分隔符终止的记录? [英] How to read records terminated by custom separator from file in python?

查看:200
本文介绍了如何从python中的文件中读取自定义分隔符终止的记录?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在python中为文件中的行做,其中行尾被重新定义为我想要的任何字符串。另一种说法是我想从文件而不是行读取记录;我希望它与读取行同样快捷方便。

I would like a way to do for line in file in python, where the end of line is redefined to be any string that I want. Another way of saying that is I want to read records from file rather than lines; I want it to be equally fast and convenient to do as reading lines.

这是相当于设置perl的 $ / 输入记录分隔符,或在java中使用 Scanner 。对于文件中的行,这不一定必须使用(特别是,迭代器可能不是文件对象)。只是等同于避免将太多数据读入内存的东西。

This is the python equivalent to setting perl's $/ input record separator, or using Scanner in java. This doesn't necessarily have to use for line in file (in particular, the iterator may not be a file object). Just something equivalent which avoids reading too much data into memory.

参见:
添加对使用任意分隔符读取记录到标准IO堆栈的支持

推荐答案

在Python 2.x 文件对象或Python 3.3 io 类中没有任何内容可以让您指定自定义 readline 的分隔符。 (文件中的行最终使用与 readline 相同的代码。)

There is nothing in the Python 2.x file object, or the Python 3.3 io classes, that lets you specify a custom delimiter for readline. (The for line in file is ultimately using the same code as readline.)

但是自己构建它很容易。例如:

But it's pretty easy to build it yourself. For example:

def delimited(file, delimiter='\n', bufsize=4096):
    buf = ''
    while True:
        newbuf = file.read(bufsize)
        if not newbuf:
            yield buf
            return
        buf += newbuf
        lines = buf.split(delimiter)
        for line in lines[:-1]:
            yield line
        buf = lines[-1]






这是一个愚蠢的例子:


Here's a stupid example of it in action:

>>> s = io.StringIO('abcZZZdefZZZghiZZZjklZZZmnoZZZpqr')
>>> d = delimited(s, 'ZZZ', bufsize=2)
>>> list(d)
['abc', 'def', 'ghi', 'jkl', 'mno', 'pqr']






如果你想让二进制文件和文本文件都正确,特别是在3.x中,它有点棘手。但是如果只需要为一个或另一个(以及一种语言或另一种语言)工作,你可以忽略它。


If you want to get it right for both binary and text files, especially in 3.x, it's a bit trickier. But if it only has to work for one or the other (and one language or the other), you can ignore that.

同样,如果你使用的是Python 3 .x(或使用Python 2.x中的 io 对象),并希望使用已在 BufferedIOBase中维护的缓冲区而不是只是在缓冲区顶部放一个缓冲区,这比较棘手。 io 文档确实解释了如何做所有事情...但我不知道任何简单的例子,所以你真的必须阅读该页面的至少一半并浏览其余部分。 (当然,你可以直接使用原始文件......但如果你想找到unicode分隔符则不能......)

Likewise, if you're using Python 3.x (or using io objects in Python 2.x), and want to make use of the buffers that are already being maintained in a BufferedIOBase instead of just putting a buffer on top of the buffer, that's trickier. The io docs do explain how to do everything… but I don't know of any simple examples, so you're really going to have to read at least half of that page and skim the rest. (Of course, you could just use the raw files directly… but not if you want to find unicode delimiters…)

这篇关于如何从python中的文件中读取自定义分隔符终止的记录?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆