itertools.izip breakness [英] itertools.izip brokeness

查看:73
本文介绍了itertools.izip breakness的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

下面的代码应该是不言自明的。

我想要并行读取两个文件,这样我就可以打印每个文件的相应行,

方。 itertools.izip()似乎是显而易见的方式

这样做。


izip()会在达到
$ b $时停止交互b最短文件的结尾。我不知道怎么说b $ b告诉哪个文件已经用尽所以我只是尝试打印

他们两个。耗尽的一个将产生一个

StopInteration,另一个将继续是b
迭代。


问题是有时,取决于哪个
文件越短,一行最终丢失,

既不出现在izip()输出中,也不出现在

中随后的直接文件迭代。当izip

由于另一个文件的异常而终止时,我会猜测它是在izip'的缓冲区中。


这种行为似乎很明显,特别是

,因为它依赖于izip'的
参数的顺序,而且没有记录在我看到的任何地方。

它使用izip()来迭代文件

并行基本上没用(除非你很幸运能够拥有相同长度的文件)。 />

另外,在我看来这可能是一个问题

与任何不同长度的迭代。

我希望我是遗失了什么...


#-------------------------------- -------------------------

#任务:在第1列打印file1的内容,并且

#第二列中file2的内容。迭代器和

#izip()是显而易见的这样做的方法。


来自itertools import izip

import cStringIO,pdb


def prt_files(file1, file2):

$ 1 $ b for line1,line2 in izip(file1,file2):

print line1.rstrip()," \t",line2 .rstrip()


尝试:
$ 1 $ b for file1 in file1:

print line1,

除了StopIteration:传递


尝试:
$ 2 $ b for file2 in file2:

print" \t",line2,<除了StopIteration之外,还有
:传递


如果__name__ ==" __ main __":

#使用StringIO来模拟文件。真实文件

#显示相同的行为。

f = cStringIO.StringIO


print"两个行数相同的文件工作正常。

prt_files(f(" abc\\\
de\\\
fgh \ n")),f(" xyz\\\
wv \ nstu \ n"))


打印" \ n第一个文件更短也没问题。

prt_files(f(" abc \ ndde \ n)"),f (xyz\\\
wv\\\
stu\\\
))


打印\ n第二个文件缩短是一个问题。

prt_files(f(" abc\\\
de\\\
fgh \ n")),f(" xyz \\\
wv \ n"))

print"怎么回事? ; fgh\"应该在列中的行

1?"


print" \ n但是只有一行的问题。

prt_files(f(" abc\\\
de\\\
fgh\\\
ijk\\\
lm\ n)),f(" xyz\\\
wv \ n"))

printthe line \fgh \仍然缺少,但跟随\ n \

行是可以的!看起来像izip()吃了一行。

The code below should be pretty self-explanatory.
I want to read two files in parallel, so that I
can print corresponding lines from each, side by
side. itertools.izip() seems the obvious way
to do this.

izip() will stop interating when it reaches the
end of the shortest file. I don''t know how to
tell which file was exhausted so I just try printing
them both. The exhausted one will generate a
StopInteration, the other will continue to be
iterable.

The problem is that sometimes, depending on which
file is the shorter, a line ends up missing,
appearing neither in the izip() output, or in
the subsequent direct file iteration. I would
guess that it was in izip''s buffer when izip
terminates due to the exception on the other file.

This behavior seems plain out broken, especially
because it is dependent on order of izip''s
arguments, and not documented anywhere I saw.
It makes using izip() for iterating files in
parallel essentially useless (unless you are
lucky enough to have files of the same length).

Also, it seems to me that this is likely a problem
with any iterables with different lengths.
I am hoping I am missing something...

#---------------------------------------------------------
# Task: print contents of file1 in column 1, and
# contents of file2 in column two. iterators and
# izip() are the "obvious" way to do it.

from itertools import izip
import cStringIO, pdb

def prt_files (file1, file2):

for line1, line2 in izip (file1, file2):
print line1.rstrip(), "\t", line2.rstrip()

try:
for line1 in file1:
print line1,
except StopIteration: pass

try:
for line2 in file2:
print "\t",line2,
except StopIteration: pass

if __name__ == "__main__":
# Use StringIO to simulate files. Real files
# show the same behavior.
f = cStringIO.StringIO

print "Two files with same number of lines work ok."
prt_files (f("abc\nde\nfgh\n"), f("xyz\nwv\nstu\n"))

print "\nFirst file shorter is also ok."
prt_files (f("abc\nde\n"), f("xyz\nwv\nstu\n"))

print "\nSecond file shorter is a problem."
prt_files (f("abc\nde\nfgh\n"), f("xyz\nwv\n"))
print "What happened to \"fgh\" line that should be in column
1?"

print "\nBut only a problem for one line."
prt_files (f("abc\nde\nfgh\nijk\nlm\n"), f("xyz\nwv\n"))
print "The line \"fgh\" is still missing, but following\n" \
"line(s) are ok! Looks like izip() ate a line."

推荐答案

ru *** @ yahoo.com 写道:
问题在于,有时候,根据哪个文件更短,一行最终会丢失,出现既不在izip()
输出中,也不在随后的直接文件迭代中。当izip由于另一个文件上的
异常而终止时,我猜它是在izip的缓冲区中。
The problem is that sometimes, depending on which file is the
shorter, a line ends up missing, appearing neither in the izip()
output, or in the subsequent direct file iteration. I would guess
that it was in izip''s buffer when izip terminates due to the
exception on the other file.




哦男人,这很难看。问题是除了通过阅读它之外,没有办法判断是否
迭代器是空的。

http://aspn.activestate.com/ASPN/Coo.../Recipe/413614


有一个你可以在一个函数中使用的kludge但是对于像izip这样的东西来说这是不好的。




对于临时hack,你可以创建一个包装迭代器,允许

将项目推回迭代器(有点像ungetc)和一个

版本的izip使用它,或者测试你使用上述方法传递它的

迭代器的izip版本。


问这个问题可能不合理一个空白测试被添加到it
迭代器接口,因为现在zillion迭代器实现

现有的不支持它。


一个不同的可能的长期修复:改变StopItera因此它需要一个可选的arg,程序可以使用它来弄清楚发生了什么
。然后更改izip,以便当其中一个迭代器args运行

out时,它将剩余的那些包装在一个新的元组中,然后将

传递给它引发的StopIteration。未经测试:


def izip(* iterlist):

而True:

z = []

finished = []#iterators已经用完了

still_alive = [] #theerators仍然活着

for i in iterlist:

尝试:

z.append(i.next())

still_alive.append(i)

除了StopIteration:

finished.append(i)

如果没有完成:

收益元组(z)

否则:

提高StopIteration,(still_alive,完成)


你会想要某种扩展的for循环语法(可能涉及

new" with"语句)用一种干净的方式来捕获异常信息。

然后你会用它来继续它停止的izip,

new(更小)迭代器列表。



Oh man, this is ugly. The problem is there''s no way to tell whether
an iterator is empty, other than by reading from it.

http://aspn.activestate.com/ASPN/Coo.../Recipe/413614

has a kludge that you can use inside a function but that''s no good
for something like izip.

For a temporary hack you could make a wrapped iterator that allows
pushing items back onto the iterator (sort of like ungetc) and a
version of izip that uses it, or a version of izip that tests the
iterators you pass it using the above recipe.

It''s probably not reasonable to ask that an emptiness test be added to
the iterator interface, since the zillion iterator implementations now
existing won''t support it.

A different possible long term fix: change StopIteration so that it
takes an optional arg that the program can use to figure out what
happened. Then change izip so that when one of its iterator args runs
out, it wraps up the remaining ones in a new tuple and passes that
to the StopIteration it raises. Untested:

def izip(*iterlist):
while True:
z = []
finished = [] # iterators that have run out
still_alive = [] # iterators that are still alive
for i in iterlist:
try:
z.append(i.next())
still_alive.append(i)
except StopIteration:
finished.append(i)
if not finished:
yield tuple(z)
else:
raise StopIteration, (still_alive, finished)

You would want some kind of extended for-loop syntax (maybe involving
the new "with" statement) with a clean way to capture the exception info.
You''d then use it to continue the izip where it left off, with the
new (smaller) list of iterators.


但这正是python迭代器的行为,我看不出是什么

坏了。


izip / zip只是从中读取各自流并返回一个元组,

如果它可以从每个获得一个,否则停止。而且因为python

迭代器只能向一个方向移动,那些消耗的东西确实会在

zip / izip调用中丢失。


我认为你需要使用不会丢弃任何东西的地图(无,......),

只是没有填充。虽然你没有一个相对懒惰的版本,因为

imap(无,......)不像地图,但有点像拉链。

< a href =mailto:ru *** @ yahoo.com> ru *** @ yahoo.com 写道:
But that is exactly the behaviour of python iterator, I don''t see what
is broken.

izip/zip just read from the respectives streams and give back a tuple,
if it can get one from each, otherwise stop. And because python
iterator can only go in one direction, those consumed do lose in the
zip/izip calls.

I think you need to use map(None,...) which would not drop anything,
just None filled. Though you don''t have a relatively lazy version as
imap(None,...) doesn''t behave like map but a bit like zip.

ru***@yahoo.com wrote:
下面的代码应该是不言自明的。
我想并行读取两个文件,这样我就可以打印每一面的相应行。
一面。 itertools.izip()似乎是显而易见的。

izip()会在到达最短文件的末尾时停止交互。我不知道如何判断哪个文件已耗尽,所以我只是尝试打印它们。精疲力竭的人会产生一个止步,另一个会继续可以迭代。

问题在于,有时候,取决于哪个文件更短,最终缺少一行,
既不出现在izip()输出中,也不出现在随后的直接文件迭代中。当izip
由于另一个文件的异常而终止时,我会猜测它是在izip'的缓冲区中。

这种行为似乎很明显,特别是
因为它依赖于izip的参数的顺序,并且没有记录在我看到的任何地方。
它使用izip()来并行处理并行的文件基本没用(除非你很幸运,有相同长度的文件。

此外,在我看来,这可能是一个问题
与任何不同长度的迭代。
我希望我遗失一些东西......

#------------------------------ ---------------------------
#任务:在第1列中打印file1的内容,以及
#的内容第二列中的file2。迭代器和
#izip()是显而易见的。来自itertools导入izip
导入cStringIO,pdb

def prt_files(file1,file2):

izip中的line1,line2(file1,file2):
print line1.rstrip()," \t",line2.rstrip()

尝试:
for line1在file1中:
打印line1,
除了StopIteration:传递

尝试:
for file2中的line2:
print" \t",line2 ,
除了StopIteration:传递

如果__name__ ==" __ main __":
#使用StringIO来模拟文件。真实文件
#显示相同的行为。
f = cStringIO.StringIO

打印具有相同行数的两个文件正常工作。
prt_files( f(abc\\\
de\\\
fgh \ n),f(" xyz\\\
wv \\\
stu \ n"))

打印" \ n第一个文件缩短也没关系。
prt_files(f(" abc\\\
de \ n")),f(" xyz\\\
wv \ nstu \ nn))
print" \\\
Second file less is a problem。
prt_files(f(" abc\\\
de\\\
fgh \ n)",f(" xyz \\\
wv \ n)" ))
打印发生了什么事情\fgh \应该在列中的行
1?"

print" \ n仅对一行有问题。
prt_files(f(" abc\\ nde\\\
fgh\\\
ijk\\\
lm \ n"),f(" xyz\\\
wv \ n"))
print" line \fgh \仍然缺少,但跟随\ n \
行可以!看起来像izip()吃了一行。
The code below should be pretty self-explanatory.
I want to read two files in parallel, so that I
can print corresponding lines from each, side by
side. itertools.izip() seems the obvious way
to do this.

izip() will stop interating when it reaches the
end of the shortest file. I don''t know how to
tell which file was exhausted so I just try printing
them both. The exhausted one will generate a
StopInteration, the other will continue to be
iterable.

The problem is that sometimes, depending on which
file is the shorter, a line ends up missing,
appearing neither in the izip() output, or in
the subsequent direct file iteration. I would
guess that it was in izip''s buffer when izip
terminates due to the exception on the other file.

This behavior seems plain out broken, especially
because it is dependent on order of izip''s
arguments, and not documented anywhere I saw.
It makes using izip() for iterating files in
parallel essentially useless (unless you are
lucky enough to have files of the same length).

Also, it seems to me that this is likely a problem
with any iterables with different lengths.
I am hoping I am missing something...

#---------------------------------------------------------
# Task: print contents of file1 in column 1, and
# contents of file2 in column two. iterators and
# izip() are the "obvious" way to do it.

from itertools import izip
import cStringIO, pdb

def prt_files (file1, file2):

for line1, line2 in izip (file1, file2):
print line1.rstrip(), "\t", line2.rstrip()

try:
for line1 in file1:
print line1,
except StopIteration: pass

try:
for line2 in file2:
print "\t",line2,
except StopIteration: pass

if __name__ == "__main__":
# Use StringIO to simulate files. Real files
# show the same behavior.
f = cStringIO.StringIO

print "Two files with same number of lines work ok."
prt_files (f("abc\nde\nfgh\n"), f("xyz\nwv\nstu\n"))

print "\nFirst file shorter is also ok."
prt_files (f("abc\nde\n"), f("xyz\nwv\nstu\n"))

print "\nSecond file shorter is a problem."
prt_files (f("abc\nde\nfgh\n"), f("xyz\nwv\n"))
print "What happened to \"fgh\" line that should be in column
1?"

print "\nBut only a problem for one line."
prt_files (f("abc\nde\nfgh\nijk\nlm\n"), f("xyz\nwv\n"))
print "The line \"fgh\" is still missing, but following\n" \
"line(s) are ok! Looks like izip() ate a line."






ru *** @ yahoo.com schrieb:
[izip()吃一行]
[izip() eats one line]




至我可以看到当前的实现无法改变

在你的情况下做正确的事情。 pythons迭代器不允许

来向前看,所以izip只能得到下一个元素。如果这对于迭代器来说是失败的,那么到目前为止的一切都会丢失。


也许izip的文档应该注意给定的

迭代器之后不一定处于理智状态。


对于你的问题,你可以做类似的事情:


def izipall(* args):

iters = [iter(it)for args]

while iters:

result = []

for iters:

试试:

x = it.next()

除了StopIteration:

iters.remove(it)

else:

result.append(x)

yield tuple(result)


请注意,这不会产生总是相同的元组

长度,所以对于x,y在izipall()中不行。相反,做一些事情

喜欢for seq in izipall():print''\ t''。join(seq)"。


希望我很清楚,大卫。



as far as i can see the current implementation cannot be changed
to do the Right Thing in your case. pythons iterators don''t allow
to "look ahead", so izip can only get the next element. if this
fails for an iterator, everything up to that point is lost.

maybe the documentation for izip should note that the given
iterators are not necessarily in a sane state afterwards.

for your problem you can do something like:

def izipall(*args):
iters = [iter(it) for it in args]
while iters:
result = []
for it in iters:
try:
x = it.next()
except StopIteration:
iters.remove(it)
else:
result.append(x)
yield tuple(result)

note that this does not yield tuples that are always the same
length, so "for x, y in izipall()" won''t work. instead, do something
like "for seq in izipall(): print ''\t''.join(seq)".

hope i was clear enough, David.


这篇关于itertools.izip breakness的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆