如何在Python中分割生成器对象或迭代器 [英] How to slice a generator object or iterator in Python

查看:313
本文介绍了如何在Python中分割生成器对象或迭代器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想循环遍历一个迭代器的切片。我不知道这是否可能,因为我明白,不可能切片迭代器。我想要做的是这样的:

  def f():
我在范围内(100):
(i)
x = f()

for i in x [95:]:
print(i)


$ b

这当然会失败:

  -------------------------------------------------- ------------------------- 
TypeError Traceback(最近一次调用最后一次)
< ipython-input-37-15f166d16ed2> ; in< module>()
4 x = f()
5
----> 6 for i in x [95:]:
7 print(i)

TypeError:'generator'对象不可自订

是否有一种通过生成器的切片循环的pythonic方法?

基本上,我真正关心的发生器读取一个非常大的文件,并逐行执行一些操作。我想测试一下这个文件的切片,以确保这些文件能够按照预期执行,但是让它在整个文件上运行是非常耗时的。


编辑:

如前所述,我需要在一个文件上。我希望有一种方法可以用生成器来显式指定它:

  import skbio 

f ='seqs.fna'
seqs = skbio.io.read(f,format ='fasta')

seqs是一个生成器对象itertools.islice(seqs,30516420,30516432)中的seq:

  
#在这里做一堆东西
传递

但是,由于发电机仍然通过所有线路,我仍然需要非常缓慢。我希望只能遍历指定的切片

通常,答案是 itertools.islice ,但是你应该注意到 islice 不能,实际上不能跳过值。它只是抓取并抛出 start 值,然后才开始 yield -ing值。所以,如果可能的话,避免使用 islice 通常是最好的办法,因为当你需要跳过很多值和/或被跳过的值对于获取/计算来说是昂贵的。如果您可以找到一种方法,不是首先生成这些值,那就这样做。在你的(明显是人为的)例子中,你只需要调整范围对象的开始索引。



试图在一个文件对象上运行的特定情况,拉大量的行(特别是从一个缓慢的媒体读取)可能不是理想的。假设你不需要特定的行,你可以使用一个技巧来避免实际读取文件的大块,而仍然在文件中测试一些距离,是 seek 到一个猜测的偏移量,读出到行的末尾(放弃你可能想要的中间行),然后 islice 那点。例如:

$ p $ 将itertools

打开('myhugefile')作为f:
#假设每行大概有80个字符,这个数据在第10万行左右的地方大概是
#,而不读取它前面的数据
f.seek(80 * 100000)
next(f )#抛弃itertools.islice(f,100)中可能出现在
中间的部分行:#处理100行
#执行每行

对于文件的特定情况,您可能还需要查看 mmap 可以以类似的方式使用(如果你是非常有用的处理数据块而不是文本行,可能会随机随意跳转)。



更新:从更新后的问题中,你需要看看你的API文档和/或数据格式,以确定如何吨o正确地跳过。它看起来像 skbio 提供了一些跳过使用 seq_num 的功能,但如果不处理大部分文件,仍然会读取。如果数据写出的序列长度相同,我会看看 Alignment 上的文档。对齐的数据可以是可加载的而不需要处理前面的数据,例如通过 Alignment.subalignment 来创建新的对齐 s来跳过alignment.Alignment.subalignmentrel =noreferrer其余的数据为你

I would like to loop over a "slice" of an iterator. I'm not sure if this is possible as I understand that it is not possible to slice an iterator. What I would like to do is this:

def f():
    for i in range(100):
        yield(i)
x = f()

for i in x[95:]:
    print(i)

This of course fails with:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-37-15f166d16ed2> in <module>()
  4 x = f()
  5 
----> 6 for i in x[95:]:
  7     print(i)

TypeError: 'generator' object is not subscriptable

Is there a pythonic way to loop through a "slice" of a generator?

Basically the generator I'm actually concerned with reads a very large file and performs some operations on it line by line. I would like to test slices of the file to make sure that things are performing as expected, but it is very time consuming to let it run over the entire file.

Edit:
As mentioned I need to to this on a file. I was hoping that there was a way of specifying this explicitly with the generator for instance:

import skbio

f = 'seqs.fna'
seqs = skbio.io.read(f, format='fasta')

seqs is a generator object

for seq in itertools.islice(seqs, 30516420, 30516432):
    #do a bunch of stuff here
    pass

The above code does what I need, however is still very slow as the generator still loops through the all of the lines. I was hoping to only loop over the specified slice

解决方案

In general, the answer is itertools.islice, but you should note that islice doesn't, and can't, actually skip values. It just grabs and throws away start values before it starts yield-ing values. So it's usually best to avoid islice if possible when you need to skip a lot of values and/or the values being skipped are expensive to acquire/compute. If you can find a way to not generate the values in the first place, do so. In your (obviously contrived) example, you'd just adjust the start index for the range object.

In the specific cases of trying to run on a file object, pulling a huge number of lines (particularly reading from a slow medium) may not be ideal. Assuming you don't need specific lines, one trick you can use to avoid actually reading huge blocks of the file, while still testing some distance in to the file, is the seek to a guessed offset, read out to the end of the line (to discard the partial line you probably seeked to the middle of), then islice off however many lines you want from that point. For example:

import itertools

with open('myhugefile') as f:
    # Assuming roughly 80 characters per line, this seeks to somewhere roughly
    # around the 100,000th line without reading in the data preceding it
    f.seek(80 * 100000)
    next(f)  # Throw away the partial line you probably landed in the middle of
    for line in itertools.islice(f, 100):  # Process 100 lines
        # Do stuff with each line

For the specific case of files, you might also want to look at mmap which can be used in similar ways (and is unusually useful if you're processing blocks of data rather than lines of text, possibly randomly jumping around as you go).

Update: From your updated question, you'll need to look at your API docs and/or data format to figure out exactly how to skip around properly. It looks like skbio offers some features for skipping using seq_num, but that's still going to read if not process most of the file. If the data was written out with equal sequence lengths, I'd look at the docs on Alignment; aligned data may be loadable without processing the preceding data at all, by e.g by using Alignment.subalignment to create new Alignments that skip the rest of the data for you.

这篇关于如何在Python中分割生成器对象或迭代器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆