Numpy的genfromtxt根据dtype参数返回不同的结构化数据 [英] Numpy's genfromtxt returns different structured data depending on dtype parameters

查看:117
本文介绍了Numpy的genfromtxt根据dtype参数返回不同的结构化数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下内容:

from numpy import genfromtxt    
seg_data1 = genfromtxt('./datasets/segmentation.all', delimiter=',', dtype="|S5")
seg_data2 = genfromtxt('./datasets/segmentation.all', delimiter=',', dtype=["|S5"] + ["float" for n in range(19)])

print seg_data1
print seg_data2

print seg_data1[:,0:1]
print seg_data2[:,0:1]

事实证明,seg_data1seg_data2是不同类型的结构.这是打印的内容:

it turns out that seg_data1 and seg_data2 are not the same kind of structure. Here's what printed:

[['BRICK' '140.0' '125.0' ..., '7.777' '0.545' '-1.12']
 ['BRICK' '188.0' '133.0' ..., '8.444' '0.538' '-0.92']
 ['BRICK' '105.0' '139.0' ..., '7.555' '0.532' '-0.96']
 ..., 
 ['CEMEN' '128.0' '161.0' ..., '10.88' '0.540' '-1.99']
 ['CEMEN' '150.0' '158.0' ..., '12.22' '0.503' '-1.94']
 ['CEMEN' '124.0' '162.0' ..., '14.55' '0.479' '-2.02']]
[ ('BRICK', 140.0, 125.0, 9.0, 0.0, 0.0, 0.2777779, 0.06296301, 0.66666675, 0.31111118, 6.185185, 7.3333335, 7.6666665, 3.5555556, 3.4444444, 4.4444447, -7.888889, 7.7777777, 0.5456349, -1.1218182)
 ('BRICK', 188.0, 133.0, 9.0, 0.0, 0.0, 0.33333334, 0.26666674, 0.5, 0.077777736, 6.6666665, 8.333334, 7.7777777, 3.8888888, 5.0, 3.3333333, -8.333333, 8.444445, 0.53858024, -0.92481726)
 ('BRICK', 105.0, 139.0, 9.0, 0.0, 0.0, 0.27777782, 0.107407436, 0.83333325, 0.52222216, 6.111111, 7.5555553, 7.2222223, 3.5555556, 4.3333335, 3.3333333, -7.6666665, 7.5555553, 0.5326279, -0.96594584)
 ...,
 ('CEMEN', 128.0, 161.0, 9.0, 0.0, 0.0, 0.55555534, 0.25185192, 0.77777785, 0.16296278, 7.148148, 5.5555553, 10.888889, 5.0, -4.7777777, 11.222222, -6.4444447, 10.888889, 0.5409177, -1.9963073)
 ('CEMEN', 150.0, 158.0, 9.0, 0.0, 0.0, 2.166667, 1.6333338, 1.388889, 0.41851807, 8.444445, 7.0, 12.222222, 6.111111, -4.3333335, 11.333333, -7.0, 12.222222, 0.50308645, -1.9434487)
 ('CEMEN', 124.0, 162.0, 9.0, 0.11111111, 0.0, 1.3888888, 1.1296295, 2.0, 0.8888891, 10.037037, 8.0, 14.555555, 7.5555553, -6.111111, 13.555555, -7.4444447, 14.555555, 0.4799313, -2.0293121)]
[['BRICK']
 ['BRICK']
 ['BRICK']
 ..., 
 ['CEMEN']
 ['CEMEN']
 ['CEMEN']]
Traceback (most recent call last):
  File "segmentationdata.py", line 14, in <module>
    print seg_data2[:,0:1]
IndexError: too many indices for array

我宁愿让genfromtxtseg_data1的形式返回数据,尽管我不知道有任何强制seg_data2符合该类型的内置方法.据我所知,没有简单的方法可以做到:

I'd rather have genfromtxt return data in the form of seg_data1, though I don't know of any built-in way to force seg_data2 to conform to that type. As far as I know there's no easy way to do:

seg_target1 = seg_data1[:,0:1]
seg_data1 = seg_data1[:,1:]

.现在我可以做data.astype(float)了,但重点是,当我给它dtype数组时,不是genfromtxt应该做的吗?

for seg_data2. Now I could do data.astype(float) but the point is, isn't that what genfromtxt should have done to begin with when I gave it that dtype array?

推荐答案

使用dtype="|S5"可以将所有列导入为字符串(5个字符).结果是一个二维数组,其中包含类似

With dtype="|S5" you import all columns as strings (5 char). The result is a 2d array with rows like

['BRICK' '140.0' '125.0' ..., '7.777' '0.545' '-1.12']

使用dtype=["|S5"] + ["float" for n in range(19)]可以为每列指定dtype,结果是结构化数组.它是20个字段的1d.您可以按名称(请参见set_data2.dtype)而不是按列号访问这些字段.

With dtype=["|S5"] + ["float" for n in range(19)] you specify the dtype for each column, the result is a structured array. It is 1d with 20 fields. You access the fields by name (look at set_data2.dtype), not by column number.

此数组的元素或记录显示为元组,并且包含字符串和19个浮点数:

A element, or record, of this array is displayed as a tuple, and includes a string and 19 floats:

('BRICK', 140.0, 125.0, 9.0, 0.0, 0.0, 0.2777779, 0.06296301, 0.66666675, 0.31111118, 6.185185, 7.3333335, 7.6666665, 3.5555556, 3.4444444, 4.4444447, -7.888889, 7.7777777, 0.5456349, -1.1218182)

#初始字符列

print set_data2['f0']  

指定dtype=None应该产生相同的结果,可能带有一些整数列而不是所有浮点数.

Specifying dtype=None should produce the same thing, possibly with some integer columns instead of all floats.

也可以用两个字段指定一个dtype,其中一个为字符串列,另一个为19个浮点数.我必须检查文档并运行一些测试用例才能确定格式.

It is also possible to specify a dtype with 2 fields, one the string column, and the other the 19 floats. I'd have to check the docs and run a few test cases to be sure of the format.

我认为您阅读了很多genfromtxt文档,以了解可以指定复合dtype,但不足以理解结果.

I think you read enough of genfromtxt docs to see that you could specify a compound dtype, but not enough to understand the results.

=================

=================

使用文本和数字导入csv的示例:

Example of importing csv with text and numbers:

In [139]: txt=b"""one 1 2 3
     ...: two 4 5 6
     ...: """

默认:所有浮动

In [140]: np.genfromtxt(txt.splitlines())
Out[140]: 
array([[ nan,   1.,   2.,   3.],
       [ nan,   4.,   5.,   6.]])

自动dtype选择-4个字段

automatic dtype selection - 4 fields

In [141]: np.genfromtxt(txt.splitlines(),dtype=None)
Out[141]: 
array([(b'one', 1, 2, 3), (b'two', 4, 5, 6)], 
      dtype=[('f0', 'S3'), ('f1', '<i4'), ('f2', '<i4'), ('f3', '<i4')])

用户指定的字段dtypes

user specified field dtypes

In [142]: np.genfromtxt(txt.splitlines(),dtype='str,int,float,int')
Out[142]: 
array([('', 1, 2.0, 3), ('', 4, 5.0, 6)], 
      dtype=[('f0', '<U'), ('f1', '<i4'), ('f2', '<f8'), ('f3', '<i4')])

复合dtype,其中数字字段的列数(以及对字符串列的更正)

Compound dtype, with column count for the numeric field (and correction to string column)

In [145]: np.genfromtxt(txt.splitlines(),dtype='S5,(3)int')
Out[145]: 
array([(b'one', [1, 2, 3]), (b'two', [4, 5, 6])], 
      dtype=[('f0', 'S5'), ('f1', '<i4', (3,))])

In [146]: _['f0']
Out[146]: 
array([b'one', b'two'], 
      dtype='|S5')

In [149]: _['f1']
Out[149]: 
array([[1, 2, 3],
       [4, 5, 6]])

如果您需要在数字字段之间进行数学运算,则最后一种情况(或更详细的情况)可能最方便.

If you need to do math across the numeric fields, this last case (or something more elaborate) might be most convenient.

要生成更复杂的内容,最好在单独的表达式中开发dtype(dtype语法可能很棘手)

To generate something more complicated it may be best to develop the dtype in a separate expression (dtype syntax can be tricky)

In [172]: dt=np.dtype([('f0','|S5'),('f1',[('f10',int),('f11',float,(2))])])

In [173]: np.genfromtxt(txt.splitlines(),dtype=dt)
Out[173]: 
array([(b'one', (1, [2.0, 3.0])), (b'two', (4, [5.0, 6.0]))], 
      dtype=[('f0', 'S5'), ('f1', [('f10', '<i4'), ('f11', '<f8', (2,))])])

这篇关于Numpy的genfromtxt根据dtype参数返回不同的结构化数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆