访问具有索引列表的项目列表 [英] Access list of items with list of indices

查看:103
本文介绍了访问具有索引列表的项目列表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

考虑从大型csv文件(80 MB)返回的大型命名项(第一行),可能存在间断间隔

  name_line = ['a',,'b',,'c'.... ,,'cb','cc'] 

我正在逐行读取剩余的数据,我只需要处理具有相应名称的数据。数据可能看起来像

  data_line = ['10',,'。5',,'10289'......, ,'16 .7','0'] 

我尝试了两种方法。一个是从读取的每一行弹出空列

  blnk_cols = [1,3,...,97] 
而数据:
...
指数blnk_cols:data_line.pop(指数)

另一个是从L1编译与名称相关联的项目

  good_cols = [0,2, 4,...,98,99] 
而数据:
...
data_line = [data_line [index] for good_cols中的索引]

在我使用的数据中肯定会有更好的线条然后坏线虽然它可能高达一半半。



我使用cProfile和pstats包确定我速度最弱的链接,这表明pop是当前最慢的项目。我切换到列表comp,时间几乎加倍。



我想一种快速的方法是切片检索只有好数据的数组,但对于具有交替空白和良好数据的文件来说这将是复杂的。



我真正需要的是能够做到

  data_line = data_line [good_cols] 

有效地将索引列表传递到列表中以获取这些项目。
现在我的程序在大约2.3秒内运行10 MB文件,弹出帐户大约0.3秒。



是否有更快的方式来访问列表中的某些位置。在C中,它只是取消引用指向数组中正确索引的指针数组。



添加:
读取之前文件中的name_line

  a,b,c,d,e,f,g ,,,,, h,i,j,k ,,,, l,m,n,

读取和拆分后的name_line(,)

  ['a','b','c','d','e','f','g','' , '', '', '', 'H', 'I', 'J', 'K', '', '', '', 'L', 'M', 'N','\ n'] 


解决方案

尝试使用生成器表达式,

  data_line =(good_cols中i的data_line [i])

此处还有关于
的信息,请参阅生成器表达式与列表理解



如上面的答案告诉你:'基本上,如果您所做的只是迭代一次,请使用生成器表达式

Consider a large list of named items (first line) returned from a large csv file (80 MB) with possible interrupted spacing

name_line =  ['a',,'b',,'c' .... ,,'cb','cc']

I am reading the remainder of the data in line by line and I only need to process data with a corresponding name. Data might look like

data_line =  ['10',,'.5',,'10289' .... ,,'16.7','0']

I tried it two ways. One is popping the empty columns from each line of the read

blnk_cols = [1,3, ... ,97]
while data:
    ...
    for index in blnk_cols: data_line.pop(index)

the other is compiling the items associated with a name from L1

good_cols = [0,2,4, ... ,98,99]   
while data:
    ...
    data_line = [data_line[index] for index in good_cols]

in the data I am using there will definitely be more good lines then bad lines although it might be as high as half and half.

I used the cProfile and pstats package to determine my weakest links in speed which suggested the pop was the current slowest item. I switched to the list comp and the time almost doubled.

I imagine one fast way would be to slice the array retrieving only good data, but this would be complicated for files with alternating blank and good data.

what I really need is to be able to do

data_line = data_line[good_cols]

effectively passing a list of indices into a list to get back those items. Right now my program is running in about 2.3 seconds for a 10 MB file and the pop accounts for about .3 seconds.

Is there a faster way to access certain locations in a list. In C it would just be de-referencing an array of pointers to the correct indices in the array.

Additions: name_line in file before read

a,b,c,d,e,f,g,,,,,h,i,j,k,,,,l,m,n,

name_line after read and split(",")

['a','b','c','d','e','f','g','','','','','h','i','j','k','','','','l','m','n','\n']

解决方案

Try a generator expression,

data_line = (data_line[i] for i in good_cols)

Also read here about Generator Expressions vs. List Comprehension

as the top answer tells you: 'Basically, use a generator expression if all you're doing is iterating once'.

So you should benefit from this.

这篇关于访问具有索引列表的项目列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆