可变长度的numpy recarray字符串 [英] numpy recarray strings of variable length

查看:271
本文介绍了可变长度的numpy recarray字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否可以在不事先知道字符串长度的情况下初始化容纳字符串的numpy数组?

Is it possible to initialise a numpy recarray that will hold strings, without knowing the length of the strings beforehand?

作为(人为)示例:

mydf = np.empty( (numrows,), dtype=[ ('file_name','STRING'), ('file_size_MB',float) ] )

问题在于我在构造rearray之前要先填充信息,而我不一定事先知道file_name的最大长度.

The problem is that I'm constructing my recarray in advance of populating it with information, and I don't necessarily know the maximum length of file_name in advance.

我所有的尝试都导致字符串字段被截断:

All my attempts result in the string field being truncated:

>>> mydf = np.empty( (2,), dtype=[('file_name',str),('file_size_mb',float)] )
>>> mydf['file_name'][0]='foobarasdf.tif'
>>> mydf['file_name'][1]='arghtidlsarbda.jpg'
>>> mydf
array([('', 6.9164002347457e-310), ('', 9.9413127e-317)], 
      dtype=[('file_name', 'S'), ('file_size_mb', '<f8')])
>>> mydf['file_name']
array(['f', 'a'], 
      dtype='|S1')

(顺便说一句,为什么mydf['file_name']为什么显示'f'和'a',而mydf为什么显示''和'?)

(As an aside, why does mydf['file_name'] show 'f' and 'a' whilst mydf shows '' and ''?)

类似地,如果我将类型(例如)|S10初始化为file_name,则长度将被截断为10.

Similarly, if I initialise with type (say) |S10 for file_name then things get truncated at length 10.

我可以找到的唯一类似问题是这个问题,但这可以计算出适当的字符串长度先验,因此与我的字符串长度不太一样(因为我事先并不知道).

The only similar question I could find is this one, but this calculates the appropriate string length a priori and hence is not quite the same as mine (as I know nothing in advance).

除了用|S9999999999999来表示file_name(即一些可笑的上限)以外,还有其他选择吗?

Is there any alternative other than initalising the file_name with (eg) |S9999999999999 (ie some ridiculous upper limit)?

推荐答案

人们可以始终使用object作为dtype,而不是使用STRING dtype.这将允许将任何对象分配给数组元素,包括Python可变长度字符串.例如:

Instead of using the STRING dtype, one can always use object as dtype. That will allow any object to be assigned to an array element, including Python variable length strings. For example:

>>> import numpy as np
>>> mydf = np.empty( (2,), dtype=[('file_name',object),('file_size_mb',float)] )
>>> mydf['file_name'][0]='foobarasdf.tif'
>>> mydf['file_name'][1]='arghtidlsarbda.jpg'
>>> mydf
array([('foobarasdf.tif', 0.0), ('arghtidlsarbda.jpg', 0.0)], 
      dtype=[('file_name', '|O8'), ('file_size_mb', '<f8')])

拥有可变长度的元素是违反数组概念的精神,但这是尽可能接近的.数组的概念是将元素存储在内存中定义良好且规则间隔的内存地址中,这禁止了可变长度的元素.通过将指向字符串的指针存储在数组中,可以避免这种限制. (基本上就是上面的示例.)

It is a against the spirit of the array concept to have variable length elements, but this is as close as one can get. The idea of an array is that elements are stored in memory at well-defined and regularly spaced memory addresses, which prohibits variable length elements. By storing the pointers to a string in an array, one can circumvent this limitation. (This is basically what the above example does.)

这篇关于可变长度的numpy recarray字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆