numpy:带有可变列数的loadtxt() [英] Numpy: loadtxt() with variable number of columns

查看:340
本文介绍了numpy:带有可变列数的loadtxt()的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个制表符分隔值的文件,其中文件的前半部分有3列N行,后半部分有2列M行.我需要将此类文件转换为两个单独的数组:3xN和2xM.

示例:

   6.7900209022264466       -3.8259897286289504        13.563976248832137     
   1.5334543760683907        12.723711617874176        1.5148291755004299     
   2.4282763900233522        9.1305022788201136       -3.1003673775485394     
  -6.5344717544805586E-002  -12.487743380186622        2.6928902187606480     
   8.9067951331740804        13.403331728374390      -0.58045132774289632     
  -11.842481592786449       -5.7083783211328551        1.9526760053685255     
  -10.240286781275808        13.204312088815593        4.4856524683466175     
  -4.6690658488407504       -6.2809313597959449        7.4378900284937082     
  -9.5874077836478282       -8.6799071183782903       -1.8203838010218165     
  0.62588896716878051       -5.4614995295716540        11.166650096421838     
           0        4173
           0        1998
           0         611
           0        8606
           1        6912
           1        9671
           1        7993
           1        8513
           2        5556
           2        4422
           2        3047

我不能简单地使用loadtxt()来读取这样的文件,因为这会导致错误ValueError: Wrong number of columns at line ...

是否可以使用loadtxt()或类似的功能来读取此类文件?

我想避免使用readlines()split(),然后再转换为float,因为这会使代码变慢(我认为...)并且更长.我也尝试过pandas.read_csv(),但是我需要一个数组作为输出.


更新:

目前,按照 hpaulj 的建议,我正在使用readlines()split()这样做:

    with open(filename,"r") as f:
        all_data=[x.split() for x in f.readlines()]
        a=array([map(float,x) for x in all_data[:N]])
        b=array([map(int,x) for x in all_data[N+1:]])

它实际上非常快,但是我仍然想知道是否有人知道更快的方法,也许更简单的方法.

解决方案

不幸的是,使用[x.split() for x in f.readlines()]会将所有行作为字符串对象加载到python列表中,这会很慢,并且比numpy数组需要更多的内存. /p>

假设您已经预先知道分割线(因为您在建议中使用了N),则可以执行以下操作:

from itertools import islice

with open(filename, 'r') as f:
    first_part = numpy.loadtxt(islice(f, N))
    second_part = numpy.loadtxt(f)

islice 是将停止的工具在numpy读取N行之后生成行.当在同一文件上调用第二个loadtxt时,numpy将从先前停止的位置开始,因此您无需执行其他任何操作.

由于仅使用生成器,因此不需要将所有中间行都存储为字符串.

I have a file of tab-separated values where the first half of the file has 3 columns and N rows and the second half has 2 columns and M rows. I need to convert such a file into two separate arrays: a 3xN and a 2xM.

Example:

   6.7900209022264466       -3.8259897286289504        13.563976248832137     
   1.5334543760683907        12.723711617874176        1.5148291755004299     
   2.4282763900233522        9.1305022788201136       -3.1003673775485394     
  -6.5344717544805586E-002  -12.487743380186622        2.6928902187606480     
   8.9067951331740804        13.403331728374390      -0.58045132774289632     
  -11.842481592786449       -5.7083783211328551        1.9526760053685255     
  -10.240286781275808        13.204312088815593        4.4856524683466175     
  -4.6690658488407504       -6.2809313597959449        7.4378900284937082     
  -9.5874077836478282       -8.6799071183782903       -1.8203838010218165     
  0.62588896716878051       -5.4614995295716540        11.166650096421838     
           0        4173
           0        1998
           0         611
           0        8606
           1        6912
           1        9671
           1        7993
           1        8513
           2        5556
           2        4422
           2        3047

I cannot simply use loadtxt() to read such a file because this would result in the error ValueError: Wrong number of columns at line ...

Is there a way to use loadtxt() or some similar function to read such a file?

I would like to avoid using readlines() and split() and then convert to float, because this would make the code slower (I think...) and longer. I have also tried pandas.read_csv(), but I need an array as output.


Update:

For now, following hpaulj's suggestion, I'm doing it like this using readlines() and split():

    with open(filename,"r") as f:
        all_data=[x.split() for x in f.readlines()]
        a=array([map(float,x) for x in all_data[:N]])
        b=array([map(int,x) for x in all_data[N+1:]])

It is actually pretty fast, but I would still like to know if someone knows a faster -and maybe simpler- method.

解决方案

Using [x.split() for x in f.readlines()] will unfortunately load all lines as strings objects in a python list, which will be slow and require a lot more memory than a numpy array.

Assuming you know in advance the line of the split (since you used N in your suggestion), you can do the following:

from itertools import islice

with open(filename, 'r') as f:
    first_part = numpy.loadtxt(islice(f, N))
    second_part = numpy.loadtxt(f)

islice is a tool that will stop generating lines after numpy has read N lines. When calling the second loadtxt on the same file, numpy will start where it previously stopped, so you don't have to do anything more.

Because it's using only generators, it doesn't require to store all the intermediary lines as strings.

这篇关于numpy:带有可变列数的loadtxt()的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆