迭代字符串的行 [英] Iterate over the lines of a string

查看:196
本文介绍了迭代字符串的行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个这样定义的多行字符串:

I have a multi-line string defined like this:

foo = """
this is 
a multi-line string.
"""

我们用作测试的字符串 - 我正在写的解析器的输入。解析器函数接收文件 -object作为输入并迭代它。它还直接调用 next()方法来跳过行,所以我真的需要一个迭代器作为输入,而不是迭代。
我需要一个迭代器来遍历该字符串的各个行,就像一个文件 -object将遍历文本文件的行。我当然可以这样做:

This string we used as test-input for a parser I am writing. The parser-function receives a file-object as input and iterates over it. It does also call the next() method directly to skip lines, so I really need an iterator as input, not an iterable. I need an iterator that iterates over the individual lines of that string like a file-object would over the lines of a text-file. I could of course do it like this:

lineiterator = iter(foo.splitlines())

有更直接的方法吗?在这种情况下,字符串必须遍历一次以进行拆分,然后再由解析器遍历。在我的测试用例中没关系,因为那里的字符串很短,我只是出于好奇而问。 Python为这些东西提供了许多有用且高效的内置插件,但我找不到任何适合这种需求的东西。

Is there a more direct way of doing this? In this scenario the string has to traversed once for the splitting, and then again by the parser. It doesn't matter in my test-case, since the string is very short there, I am just asking out of curiosity. Python has so many useful and efficient built-ins for such stuff, but I could find nothing that suits this need.

推荐答案

这里有三种可能性:

foo = """
this is 
a multi-line string.
"""

def f1(foo=foo): return iter(foo.splitlines())

def f2(foo=foo):
    retval = ''
    for char in foo:
        retval += char if not char == '\n' else ''
        if char == '\n':
            yield retval
            retval = ''
    if retval:
        yield retval

def f3(foo=foo):
    prevnl = -1
    while True:
      nextnl = foo.find('\n', prevnl + 1)
      if nextnl < 0: break
      yield foo[prevnl + 1:nextnl]
      prevnl = nextnl

if __name__ == '__main__':
  for f in f1, f2, f3:
    print list(f())

运行此作为主脚本确认三个功能是等效的。使用 timeit * 100 foo 获取用于更精确测量的实质字符串):

Running this as the main script confirms the three functions are equivalent. With timeit (and a * 100 for foo to get substantial strings for more precise measurement):

$ python -mtimeit -s'import asp' 'list(asp.f3())'
1000 loops, best of 3: 370 usec per loop
$ python -mtimeit -s'import asp' 'list(asp.f2())'
1000 loops, best of 3: 1.36 msec per loop
$ python -mtimeit -s'import asp' 'list(asp.f1())'
10000 loops, best of 3: 61.5 usec per loop

注意我们需要 list()调用以确保迭代器是遍历,而不仅仅是构建。

Note we need the list() call to ensure the iterators are traversed, not just built.

IOW,天真的实现速度快得多,甚至不好笑:比我用尝试快6倍找到调用,这反过来比低级方法快4倍。

IOW, the naive implementation is so much faster it isn't even funny: 6 times faster than my attempt with find calls, which in turn is 4 times faster than a lower-level approach.

保留的教训:测量总是一件好事(但必须准确);像 splitlines 这样的字符串方法以非常快的方式实现;通过在非常低的级别(尤其是 + = 非常小的部分的循环)编程将字符串放在一起可能会非常慢。

Lessons to retain: measurement is always a good thing (but must be accurate); string methods like splitlines are implemented in very fast ways; putting strings together by programming at a very low level (esp. by loops of += of very small pieces) can be quite slow.

编辑:添加了@ Jacob的提案,稍加修改后会得到与其他提案相同的结果(保留一行上的尾随空白),即:

Edit: added @Jacob's proposal, slightly modified to give the same results as the others (trailing blanks on a line are kept), i.e.:

from cStringIO import StringIO

def f4(foo=foo):
    stri = StringIO(foo)
    while True:
        nl = stri.readline()
        if nl != '':
            yield nl.strip('\n')
        else:
            raise StopIteration

测量给出:

$ python -mtimeit -s'import asp' 'list(asp.f4())'
1000 loops, best of 3: 406 usec per loop

不如基于 .find 的方法 - 仍值得请记住,因为它可能不太容易出现小的一个一个错误(你看到的任何循环发生像上面的我的 f3 一样,+1和-1会依次自动触发一个怀疑 - 所以很多循环缺少这样的调整并且应该有它们 - 虽然我相信我的代码也是正确的,因为我能够检查其输出与其他函数')。

not quite as good as the .find based approach -- still, worth keeping in mind because it might be less prone to small off-by-one bugs (any loop where you see occurrences of +1 and -1, like my f3 above, should automatically trigger off-by-one suspicions -- and so should many loops which lack such tweaks and should have them -- though I believe my code is also right since I was able to check its output with other functions').

但是基于分割的方法仍然有规则。

But the split-based approach still rules.

抛开: f4 可能更好的风格是:

An aside: possibly better style for f4 would be:

from cStringIO import StringIO

def f4(foo=foo):
    stri = StringIO(foo)
    while True:
        nl = stri.readline()
        if nl == '': break
        yield nl.strip('\n')

至少,它有点不那么冗长。不幸的是,删除尾随 \ n 的需要禁止更清楚和更快地替换循环 return iter(stri) iter 部分在现代版本的Python中是多余的,我相信自2.3或2.4以来,但它是也是无害的)。也许值得一试:

at least, it's a bit less verbose. The need to strip trailing \ns unfortunately prohibits the clearer and faster replacement of the while loop with return iter(stri) (the iter part whereof is redundant in modern versions of Python, I believe since 2.3 or 2.4, but it's also innocuous). Maybe worth trying, also:

    return itertools.imap(lambda s: s.strip('\n'), stri)

或其变体 - 但我停在这里,因为它几乎是一个理论上的练习 strip 基于,最简单,最快,一个。

or variations thereof -- but I'm stopping here since it's pretty much a theoretical exercise wrt the strip based, simplest and fastest, one.

这篇关于迭代字符串的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆