用numpy从genfromtxt中排除列 [英] Exclude columns from genfromtxt with numpy
问题描述
是否可以从numpy
库中使用genfromtxt
排除所有字符串列?
Is it possible to exclude all string columns using genfromtxt
from the numpy
library?
我有一个csv文件,其中包含来自 机器学习网站的此类数据.
I have this a csv file with this type of data from the machine learning website.
antelope,1,0,0,1,0,0,0,1,1,1,0,0,4,1,0,1,1
当前使用我的设置np.genfromtxt(path, dtype=float, names=None,delimiter=',')
,它将字符串标记为nan
,这很有意义,但我想排除所有字符串字段.
Currently with my settings np.genfromtxt(path, dtype=float, names=None,delimiter=',')
it labels the string as nan
which makes sense but I would like to exclude all columns that are string.
我知道有一个usecols=(1,2)
参数,但这需要我指定此数据集或需要使用的每个数据集.我宁愿使用排除"方法而不是包含方法.
I know there is the usecols=(1,2)
parameter but that would require me to specify this or each data set I need to use. I rather prefer an "exclusion" method rather than the inclusion method.
我应该使用其他方法还是自行处理每行?
Should I use a different method or or processes each line by my self?
推荐答案
您可以在阅读后使用nan
过滤掉列.
You could filter out columns with nan
after reading.
In [52]: txt=b'antelope,1,0,0,1,0,0,0,1,1,1,0,0,4,1,0,1,1'
In [53]: txt=[txt,txt]
In [54]: A=np.genfromtxt(txt, dtype=float, names=None,delimiter=',')
In [55]: A
Out[55]:
array([[ nan, 1., 0., 0., 1., 0., 0., 0., 1., 1., 1.,
0., 0., 4., 1., 0., 1., 1.],
[ nan, 1., 0., 0., 1., 0., 0., 0., 1., 1., 1.,
0., 0., 4., 1., 0., 1., 1.]])
在所有行中都有 列;或者我可以将 columns with 另一个想法是使用 Another idea is to read the file once with 还可以过滤dtype以收集正确类型的列名 The dtype could also be filtered to collect column names that are the correct type 尽管将此结构化数组合并为具有单个dtype(int)的2d数组,还是有些麻烦(如果需要,我可以详细介绍). Though consolidating this structured array into a 2d array with a single dtype (int), is a bit of a pain (I could go into the details if needed). 这篇关于用numpy从genfromtxt中排除列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!.any
用于具有任何nan
的列.其他测试也是可能的.nan
in all rows; or I could use .any
for columns with any nan
. Other tests are possible.In [56]: ind=np.isnan(A).all(axis=0)
In [57]: ind
Out[57]:
array([ True, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False], dtype=bool)
In [58]: A[:,~ind]
Out[58]:
array([[ 1., 0., 0., 1., 0., 0., 0., 1., 1., 1., 0., 0., 4.,
1., 0., 1., 1.],
[ 1., 0., 0., 1., 0., 0., 0., 1., 1., 1., 0., 0., 4.,
1., 0., 1., 1.]])
dtype=None
读取一次文件,让genfromtxt
为每一列选择dtype.可以对生成的化合物dtype进行过滤以查找所需类型的列.dtype=None
, letting genfromtxt
choose the dtype for each column. The resulting compound dtype can be filter to find the columns of the desired type.In [118]: A=np.genfromtxt(txt, dtype=None, names=None,delimiter=',')
In [119]: ind=[i for i, d in enumerate(A.dtype.descr) if d[1]=='<i4']
In [120]: A=np.genfromtxt(txt, dtype=None, names=None,delimiter=',',usecols=ind)
In [121]: A
Out[121]:
array([[1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 4, 1, 0, 1, 1],
[1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 4, 1, 0, 1, 1]])
In [128]: A=np.genfromtxt(txt, dtype=None, names=None,delimiter=',')
In [129]: ind=[d[0] for d in A.dtype.descr if d[1]=='<i4']
In [130]: A[ind]
Out[130]:
array([(1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 4, 1, 0, 1, 1),
(1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 4, 1, 0, 1, 1)],
dtype=[('f1', '<i4'), ('f2', '<i4'), ('f3', '<i4'), ('f4', '<i4'), ('f5', '<i4'), ('f6', '<i4'), ('f7', '<i4'), ('f8', '<i4'), ('f9', '<i4'), ('f10', '<i4'), ('f11', '<i4'), ('f12', '<i4'), ('f13', '<i4'), ('f14', '<i4'), ('f15', '<i4'), ('f16', '<i4'), ('f17', '<i4')])