迭代字符串的行 [英] Iterate over the lines of a string
问题描述
我有一个这样定义的多行字符串:
I have a multi-line string defined like this:
foo = """
this is
a multi-line string.
"""
我们用作测试的字符串 - 我正在写的解析器的输入。解析器函数接收文件
-object作为输入并迭代它。它还直接调用 next()
方法来跳过行,所以我真的需要一个迭代器作为输入,而不是迭代。
我需要一个迭代器来遍历该字符串的各个行,就像一个文件
-object将遍历文本文件的行。我当然可以这样做:
This string we used as test-input for a parser I am writing. The parser-function receives a file
-object as input and iterates over it. It does also call the next()
method directly to skip lines, so I really need an iterator as input, not an iterable.
I need an iterator that iterates over the individual lines of that string like a file
-object would over the lines of a text-file. I could of course do it like this:
lineiterator = iter(foo.splitlines())
有更直接的方法吗?在这种情况下,字符串必须遍历一次以进行拆分,然后再由解析器遍历。在我的测试用例中没关系,因为那里的字符串很短,我只是出于好奇而问。 Python为这些东西提供了许多有用且高效的内置插件,但我找不到任何适合这种需求的东西。
Is there a more direct way of doing this? In this scenario the string has to traversed once for the splitting, and then again by the parser. It doesn't matter in my test-case, since the string is very short there, I am just asking out of curiosity. Python has so many useful and efficient built-ins for such stuff, but I could find nothing that suits this need.
推荐答案
这里有三种可能性:
foo = """
this is
a multi-line string.
"""
def f1(foo=foo): return iter(foo.splitlines())
def f2(foo=foo):
retval = ''
for char in foo:
retval += char if not char == '\n' else ''
if char == '\n':
yield retval
retval = ''
if retval:
yield retval
def f3(foo=foo):
prevnl = -1
while True:
nextnl = foo.find('\n', prevnl + 1)
if nextnl < 0: break
yield foo[prevnl + 1:nextnl]
prevnl = nextnl
if __name__ == '__main__':
for f in f1, f2, f3:
print list(f())
运行此作为主脚本确认三个功能是等效的。使用 timeit
( * 100
foo
获取用于更精确测量的实质字符串):
Running this as the main script confirms the three functions are equivalent. With timeit
(and a * 100
for foo
to get substantial strings for more precise measurement):
$ python -mtimeit -s'import asp' 'list(asp.f3())'
1000 loops, best of 3: 370 usec per loop
$ python -mtimeit -s'import asp' 'list(asp.f2())'
1000 loops, best of 3: 1.36 msec per loop
$ python -mtimeit -s'import asp' 'list(asp.f1())'
10000 loops, best of 3: 61.5 usec per loop
注意我们需要 list()
调用以确保迭代器是遍历,而不仅仅是构建。
Note we need the list()
call to ensure the iterators are traversed, not just built.
IOW,天真的实现速度快得多,甚至不好笑:比我用尝试快6倍找到
调用,这反过来比低级方法快4倍。
IOW, the naive implementation is so much faster it isn't even funny: 6 times faster than my attempt with find
calls, which in turn is 4 times faster than a lower-level approach.
保留的教训:测量总是一件好事(但必须准确);像 splitlines
这样的字符串方法以非常快的方式实现;通过在非常低的级别(尤其是 + =
非常小的部分的循环)编程将字符串放在一起可能会非常慢。
Lessons to retain: measurement is always a good thing (but must be accurate); string methods like splitlines
are implemented in very fast ways; putting strings together by programming at a very low level (esp. by loops of +=
of very small pieces) can be quite slow.
编辑:添加了@ Jacob的提案,稍加修改后会得到与其他提案相同的结果(保留一行上的尾随空白),即:
Edit: added @Jacob's proposal, slightly modified to give the same results as the others (trailing blanks on a line are kept), i.e.:
from cStringIO import StringIO
def f4(foo=foo):
stri = StringIO(foo)
while True:
nl = stri.readline()
if nl != '':
yield nl.strip('\n')
else:
raise StopIteration
测量给出:
$ python -mtimeit -s'import asp' 'list(asp.f4())'
1000 loops, best of 3: 406 usec per loop
不如基于 .find
的方法 - 仍值得请记住,因为它可能不太容易出现小的一个一个错误(你看到的任何循环发生像上面的我的 f3
一样,+1和-1会依次自动触发一个怀疑 - 所以很多循环缺少这样的调整并且应该有它们 - 虽然我相信我的代码也是正确的,因为我能够检查其输出与其他函数')。
not quite as good as the .find
based approach -- still, worth keeping in mind because it might be less prone to small off-by-one bugs (any loop where you see occurrences of +1 and -1, like my f3
above, should automatically trigger off-by-one suspicions -- and so should many loops which lack such tweaks and should have them -- though I believe my code is also right since I was able to check its output with other functions').
但是基于分割的方法仍然有规则。
But the split-based approach still rules.
抛开: f4
可能更好的风格是:
An aside: possibly better style for f4
would be:
from cStringIO import StringIO
def f4(foo=foo):
stri = StringIO(foo)
while True:
nl = stri.readline()
if nl == '': break
yield nl.strip('\n')
至少,它有点不那么冗长。不幸的是,删除尾随 \ n
的需要禁止更清楚和更快地替换而
循环 return iter(stri)
( iter
部分在现代版本的Python中是多余的,我相信自2.3或2.4以来,但它是也是无害的)。也许值得一试:
at least, it's a bit less verbose. The need to strip trailing \n
s unfortunately prohibits the clearer and faster replacement of the while
loop with return iter(stri)
(the iter
part whereof is redundant in modern versions of Python, I believe since 2.3 or 2.4, but it's also innocuous). Maybe worth trying, also:
return itertools.imap(lambda s: s.strip('\n'), stri)
或其变体 - 但我停在这里,因为它几乎是一个理论上的练习 strip
基于,最简单,最快,一个。
or variations thereof -- but I'm stopping here since it's pretty much a theoretical exercise wrt the strip
based, simplest and fastest, one.
这篇关于迭代字符串的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!