从字符串列表创建numpy的阵列结构 [英] create a numpy structured array from a list of strings

查看:442
本文介绍了从字符串列表创建numpy的阵列结构的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我工作的一个Python实用程序从第谷2星表中获取数据。其中一个我的工作职能查询目录和返回的所有信息对于一个给定星号(或一组恒星的id)。

I am working on a python utility to get data from the Tycho 2 star catalogue. One of the functions I am working on queries the catalogue and returns all the information for a given star id (or set of star ids).

我目前通过目录文件的行循环,然后尝试将行解析成numpy的结构数组,如果有人质疑这样做。 (注意,如果有更好的方式来做到这一点,你可以让我知道,即使这是不是这个问题是关于什么的 - 我做的这种方式,因为目录太大,它的所有加载到内存在同一时间)

I'm currently doing this by looping through the lines of the catalogue file and then attempting to parse the line into a numpy structured array if it was queried. (note if there is a better way to do this you can let me know even though this is not what this question is about -- I'm doing it this way because the catalogue is too big to load all of it into memory at one time)

不管怎样,一旦我确定了,我要保持我碰到的一个问题记录...我无法弄清楚如何将其解析成一个结构数组。

Anyway, once I have identified a record that I want to keep I've run into a problem... I can't figure out how to parse it into a structured array.

例如,假设我想保持的记录是:

For instance, say the record I want to keep is:

record = '0002 00038 1| |  3.64121230|  1.08701186|   14.1|  -23.0| 69| 82| 1.8| 1.9|1968.56|1957.30| 3|1.0|3.0|0.9|3.0|12.444|0.213|11.907|0.189|999| |         |  3.64117944|  1.08706861|1.83|1.73| 81.0|104.7| | 0.0'

现在,我试图解析与DTYPE一​​个numpy的结构数组这样的:

Now, I am trying to parse this into a numpy structured array with dtype:

        dform = [('starid', [('TYC1', int), ('TYC2', int), ('TYC3', int)]),
             ('pflag', str),
             ('starBearing', [('rightAscension', float), ('declination', float)]),
             ('properMotion', [('rightAscension', float), ('declination', float)]),
             ('uncertainty', [('rightAscension', int), ('declination', int), ('pmRA', float), ('pmDc', float)]),
             ('meanEpoch', [('rightAscension', float), ('declination', float)]),
             ('numPos', int),
             ('fitGoodness', [('rightAscension', float), ('declination', float), ('pmRA', float), ('pmDc', float)]),
             ('magnitude', [('BT', [('mag', float), ('err', float)]), ('VT', [('mag', float), ('err', float)])]),
             ('starProximity', int),
             ('tycho1flag', str),
             ('hipparcosNumber', str),
             ('observedPos', [('rightAscension', float), ('declination', float)]),
             ('observedEpoch', [('rightAscension', float), ('declination', float)]),
             ('observedError', [('rightAscension', float), ('declination', float)]),
             ('solutionType', str),
             ('correlation', float)]

这似乎应该是一个相当简单的事,但一切我尽量休息...

This seems like it should be a fairly simple thing to do but everything I try breaks...

我试过:

np.genfromtxt(BytesIO(record.encode()),dtype=dform,delimiter=(' ','|'))
np.genfromtxt(BytesIO(record.encode()),dtype=dform,delimiter=(' ','|'),missing_values=' ',filling_values=None)

这两个给了我

{TypeError}cannot perform accumulate with flexible type

这是没有意义的,因为它不应该做任何积累。

which makes no sense since it shouldn't be doing any accumulation.

我也试过

np.array(re.split('\|| ',record),dtype=dform)

这抱怨

{TypeError}a bytes-like object is required, not 'str'

和另一个变

np.array([x.encode() for x in re.split('\|| ',record)],dtype=dform)

不抛出一个错误,而且肯定不会返回正确的结果:

which doesn't throw an error but also certainly doesn't return the correct results:

[ ((842018864, 0, 0), '', (0.0, 0.0), (0.0, 0.0), (0, 0, 0.0, 0.0), (0.0, 0.0), 0, (0.0, 0.0, 0.0, 0.0), ((0.0, 0.0), (0.0, 0.0)), 0, '', '', (0.0, 0.0), (0.0, 0.0), (0.0, 0.0), '', 0.0)...

所以,我该怎么办呢?我认为genfromtxt选择是去(特别是因为有可能会偶尔丢失数据),但我不明白为什么它不工作的方式。这是东西,我只是将不得不写一个解析器我自己?

So how can I do this? I think the genfromtxt option is the way to go (especially since there may be missing data occasionally) but I don't understand why it isn't working. Is this something that I'm just going to have to write a parser for on my own?

推荐答案

对不起,这个答案是漫长而散漫,但是这就是它采取搞清楚是怎么回事。该特别是DTYPE的复杂性表现在它的长度是隐藏的。

Sorry, this answer is long and rambling, but that's what it took to figure out what is going on. The complexity of the dtype in particular was hidden by its length.

我得到的类型错误:无法执行与灵活型积累错误,当我尝试你的列表分隔符。细节显示在 LineSplitter 出现错误。没有进入细节,分隔符应该是一个字符(或默认的'空白')。

I get the TypeError: cannot perform accumulate with flexible type error when I try your list for delimiter. The details show the error occurs in LineSplitter. Without getting into details, the delimiter should be one character (or the default 'whitespace').

genfromtxt 文档:

分隔符:STR,INT,或序列,可选
          用来分隔值的字符串。缺省情况下,任何连续
          空格作为分隔符。整数或整数的序列
          也可以提供为每个字段的宽度(S)。

delimiter : str, int, or sequence, optional The string used to separate values. By default, any consecutive whitespaces act as delimiter. An integer or sequence of integers can also be provided as width(s) of each field.

genfromtxt 分路比弦更厉害一点 .split loadtxt 使用,而不是作为一般的重新分离器。

The genfromtxt splitter is a little more powerful than the string .split that loadtxt uses, but not as general as the re splitter.

对于需要{类型错误}对类字节对象,而不是'海峡',您指定的一对夫妇的领域,DTYPE STR。这是字节串,在那里作为你的记录是单向code字符串(在PY3)。但是,你已经意识到,与 BytesIO(record.en code())

As for the {TypeError}a bytes-like object is required, not 'str', you specify, for a couple of the fields, dtype 'str'. That's byte string, where as your record is unicode string (in Py3). But you've already realized that with BytesIO(record.encode()).

我想考验 genfromtxt 例:

record = b'....'
np.genfromtxt([record], ....)

或者更好的

records = b"""one line
tow line
three line
"""
np.genfromtxt(records.splitlines(), ....)

如果我让 genfromtxt 演绎字段类型,并且只使用一个分隔符,我得到32个字段:

If I let genfromtxt deduce field types, and just use the one delimiter, I get 32 fields:

In [19]: A=np.genfromtxt([record],dtype=None,delimiter='|')
In [20]: len(A.dtype)
Out[20]: 32
In [21]: A
Out[21]: 
array((b'0002 00038 1', False, 3.6412123, 1.08701186, 14.1, -23.0, 69, 82, 1.8, 1.9, 1968.56, 1957.3, 3, 1.0, 3.0, 0.9, 3.0, 12.444, 0.213, 11.907, 0.189, 999, False, False, 3.64117944, 1.08706861, 1.83, 1.73, 81.0, 104.7, False, 0.0), 
      dtype=[('f0', 'S12'), ('f1', '?'), ('f2', '<f8'), ('f3', '<f8'), ('f4', '<f8'), ... ('f26', '<f8'), ('f27', '<f8'), ('f28', '<f8'), ('f29', '<f8'), ('f30', '?'), ('f31', '<f8')])

当我们得到整个字节和分隔符的问题制定了

When we get the whole byte and delimiter issues worked out

np.array([x for x in re.split(b'\|| ',record)],dtype=dform)

不运行。现在我看到你的dform是复杂的,嵌套复合领域。

does run. I now see that your dform is complex, with nested compound fields.

但定义一个结构化的数组,你给它一个记录列表,例如

But to define a structured array, you to give it a list of records, e.g.

np.array([(record1...), (record2...), ....], dtype([(field1),(field2 ),...]))

在这里,你要创建一个记录。我可以换你的列表一个元组,但后来我得到的长度和 dform 长度之间的不匹配,66 v 17如果算上所有的子域 dform 可能需要66值,但我们不能仅仅做到这一点与一个元组。

Here you are trying to create one record. I could wrap your list in a tuple, but then I get a mismatch between that length and dform length, 66 v 17. If you count all the subfields dform might take 66 values, but we can't just do that with one tuple.

我从来没有尝试创建从一个数组这样一个复杂的 DTYPE ,所以我周围捕鱼的方式,使其工作。

I've never tried to create an array from such a complex dtype, so I'm fishing around for ways to make it work.

In [41]: np.zeros((1,),dform)
Out[41]: 
array([ ((0, 0, 0), '', (0.0, 0.0), (0.0, 0.0), (0, 0, 0.0, 0.0), (0.0, 0.0), 0, (0.0, 0.0, 0.0, 0.0), ((0.0, 0.0), (0.0, 0.0)), 0, '', '', (0.0, 0.0), (0.0, 0.0), (0.0, 0.0), '', 0.0)], 
      dtype=[('starid', [('TYC1', '<i4'), ('TYC2', '<i4'), ('TYC3', '<i4')]), ('pflag', '<U'), ('starBearing', [('rightAscension', '<f8'), ('declination', '<f8')]), ('properMotion', [('rightAscension', '<f8'), ('declination', '<f8')]), ('uncertainty', [('rightAscension', '<i4'), ('declination', '<i4'), ('pmRA', '<f8'), ('pmDc', '<f8')]), ('meanEpoch', ....('solutionType', '<U'), ('correlation', '<f8')])

In [64]: for name in A.dtype.names:
    print(A[name].dtype)
   ....:     
[('TYC1', '<i4'), ('TYC2', '<i4'), ('TYC3', '<i4')]
<U1
[('rightAscension', '<f8'), ('declination', '<f8')]
[('rightAscension', '<f8'), ('declination', '<f8')]
[('rightAscension', '<i4'), ('declination', '<i4'), ('pmRA', '<f8'), ('pmDc', '<f8')]
[('rightAscension', '<f8'), ('declination', '<f8')]
int32
[('rightAscension', '<f8'), ('declination', '<f8'), ('pmRA', '<f8'), ('pmDc', '<f8')]
[('BT', [('mag', '<f8'), ('err', '<f8')]), ('VT', [('mag', '<f8'), ('err', '<f8')])]
int32
<U1
<U1
[('rightAscension', '<f8'), ('declination', '<f8')]
[('rightAscension', '<f8'), ('declination', '<f8')]
[('rightAscension', '<f8'), ('declination', '<f8')]
<U1
float64

我算34原始DTYPE领域。大部分是标,一些2-4而言,一有嵌套的进一步水平。

I count 34 primitive dtype fields. Most are 'scalar', some 2-4 terms, one has a further level of nesting.

如果我换成第2劈裂空间| record.split(B'|')给我34串。

If I replace the first 2 spliting spaces with |, record.split(b'|') gives me 34 strings.

让我们试着在 genfromtxt

In [79]: np.genfromtxt([record],delimiter='|',dtype=dform)
Out[79]: 
array(((2, 38, 1), '', (3.6412123, 1.08701186), (14.1, -23.0), 
   (69, 82, 1.8, 1.9), (1968.56, 1957.3), 3, (1.0, 3.0, 0.9, 3.0),
   ((12.444, 0.213), (11.907, 0.189)), 999, '', '', 
   (3.64117944, 1.08706861), (1.83, 1.73), (81.0, 104.7), '', 0.0), 
      dtype=[('starid', [('TYC1', '<i4'), ('TYC2', '<i4'), ('TYC3', '<i4')]), 
 ('pflag', '<U'), 
 ('starBearing', [('rightAscension', '<f8'), ('declination', '<f8')]),  
 ('properMotion', [('rightAscension', '<f8'), ('declination', '<f8')]),
 ('uncertainty', [('rightAscension', '<i4'), ('declination', '<i4'), ('pmRA', '<f8'), ('pmDc', '<f8')]), 
 ('meanEpoch', [('rightAscension', '<f8'), ('declination', '<f8')]),   
 ('numPos', '<i4'), 
 ('fitGoodness', [('rightAscension', '<f8'), ('declination', '<f8'), ('pmRA', '<f8'), ('pmDc', '<f8')]), 
 ('magnitude', [('BT', [('mag', '<f8'), ('err', '<f8')]), ('VT', [('mag', '<f8'), ('err', '<f8')])]), 
 ('starProximity', '<i4'), ('tycho1flag', '<U'), ('hipparcosNumber', '<U'), 
 ('observedPos', [('rightAscension', '<f8'), ('declination', '<f8')]),
 ('observedEpoch', [('rightAscension', '<f8'), ('declination', '<f8')]), 
 ('observedError', [('rightAscension', '<f8'), ('declination', '<f8')]), ('solutionType', '<U'), ('correlation', '<f8')])

这几乎看起来合理。 genfromtxt 可在复合场中实际上分裂值最大。这更多的是我会想尝试与 np.array()

That almost looks reasonable. genfromtxt can actually split the values up among the compound fields. That's more that what I'd want to try with np.array().

所以,如果你得到的分隔符和字节/ UNI code制定出来的, genfromtxt 能处理这个烂摊子。

So if you get the delimiters and byte/unicode worked out, genfromtxt can handle this mess.

这篇关于从字符串列表创建numpy的阵列结构的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆