用numpy从genfromtxt中排除列 [英] Exclude columns from genfromtxt with numpy

查看:174
本文介绍了用numpy从genfromtxt中排除列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否可以从numpy库中使用genfromtxt排除所有字符串列?

Is it possible to exclude all string columns using genfromtxt from the numpy library?

我有一个csv文件,其中包含来自 机器学习网站的此类数据.

I have this a csv file with this type of data from the machine learning website.

antelope,1,0,0,1,0,0,0,1,1,1,0,0,4,1,0,1,1

当前使用我的设置np.genfromtxt(path, dtype=float, names=None,delimiter=','),它将字符串标记为nan,这很有意义,但我想排除所有字符串字段.

Currently with my settings np.genfromtxt(path, dtype=float, names=None,delimiter=',') it labels the string as nan which makes sense but I would like to exclude all columns that are string.

我知道有一个usecols=(1,2)参数,但这需要我指定此数据集或需要使用的每个数据集.我宁愿使用排除"方法而不是包含方法.

I know there is the usecols=(1,2) parameter but that would require me to specify this or each data set I need to use. I rather prefer an "exclusion" method rather than the inclusion method.

我应该使用其他方法还是自行处理每行?

Should I use a different method or or processes each line by my self?

推荐答案

您可以在阅读后使用nan过滤掉列.

You could filter out columns with nan after reading.

In [52]: txt=b'antelope,1,0,0,1,0,0,0,1,1,1,0,0,4,1,0,1,1'
In [53]: txt=[txt,txt]
In [54]: A=np.genfromtxt(txt, dtype=float, names=None,delimiter=',')
In [55]: A
Out[55]: 
array([[ nan,   1.,   0.,   0.,   1.,   0.,   0.,   0.,   1.,   1.,   1.,
          0.,   0.,   4.,   1.,   0.,   1.,   1.],
       [ nan,   1.,   0.,   0.,   1.,   0.,   0.,   0.,   1.,   1.,   1.,
          0.,   0.,   4.,   1.,   0.,   1.,   1.]])

在所有行中都有列的

列;或者我可以将.any用于具有任何nan的列.其他测试也是可能的.

columns with nan in all rows; or I could use .any for columns with any nan. Other tests are possible.

In [56]: ind=np.isnan(A).all(axis=0)
In [57]: ind
Out[57]: 
array([ True, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False], dtype=bool)
In [58]: A[:,~ind]
Out[58]: 
array([[ 1.,  0.,  0.,  1.,  0.,  0.,  0.,  1.,  1.,  1.,  0.,  0.,  4.,
         1.,  0.,  1.,  1.],
       [ 1.,  0.,  0.,  1.,  0.,  0.,  0.,  1.,  1.,  1.,  0.,  0.,  4.,
         1.,  0.,  1.,  1.]])

另一个想法是使用dtype=None读取一次文件,让genfromtxt为每一列选择dtype.可以对生成的化合物dtype进行过滤以查找所需类型的列.

Another idea is to read the file once with dtype=None, letting genfromtxt choose the dtype for each column. The resulting compound dtype can be filter to find the columns of the desired type.

In [118]: A=np.genfromtxt(txt, dtype=None, names=None,delimiter=',')
In [119]: ind=[i for i, d in enumerate(A.dtype.descr) if d[1]=='<i4']
In [120]: A=np.genfromtxt(txt, dtype=None, names=None,delimiter=',',usecols=ind) 
In [121]: A
Out[121]: 
array([[1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 4, 1, 0, 1, 1],
       [1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 4, 1, 0, 1, 1]])

还可以过滤dtype以收集正确类型的列名

The dtype could also be filtered to collect column names that are the correct type

In [128]: A=np.genfromtxt(txt, dtype=None, names=None,delimiter=',')
In [129]: ind=[d[0] for d in A.dtype.descr if d[1]=='<i4']
In [130]: A[ind]
Out[130]: 
array([(1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 4, 1, 0, 1, 1),
       (1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 4, 1, 0, 1, 1)], 
      dtype=[('f1', '<i4'), ('f2', '<i4'), ('f3', '<i4'), ('f4', '<i4'), ('f5', '<i4'), ('f6', '<i4'), ('f7', '<i4'), ('f8', '<i4'), ('f9', '<i4'), ('f10', '<i4'), ('f11', '<i4'), ('f12', '<i4'), ('f13', '<i4'), ('f14', '<i4'), ('f15', '<i4'), ('f16', '<i4'), ('f17', '<i4')])

尽管将此结构化数组合并为具有单个dtype(int)的2d数组,还是有些麻烦(如果需要,我可以详细介绍).

Though consolidating this structured array into a 2d array with a single dtype (int), is a bit of a pain (I could go into the details if needed).

这篇关于用numpy从genfromtxt中排除列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆