numpy中的可变/未知长度字符串/unicode dtype [英] Variable/unknown length string/unicode dtype in numpy

查看:262
本文介绍了numpy中的可变/未知长度字符串/unicode dtype的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否有可能以未知字段长度的文本字段加载数组?

Is it possible to somehow load an array with a text field of unknown field length?

我想出了如何传递dtype来获取字符串.但是,在没有指定长度的情况下,我只能得到U0.似乎无法保存任何数据的类型.例如:

I figured out how to pass dtype to get string into it. However, with out specifying length i just get U0. Type which seems not to be able to hold any data. E.g:

data = io.StringIO("test data lololol\ntest2 d4t4 ololol")
>>> ar = numpy.loadtxt(data, dtype=[("1",str), ("2",'S'), ("3",'S')])
>>> ar
array([('', b'', b''), ('', b'', b'')], 
      dtype=[('1', '<U0'), ('2', '|S0'), ('3', '|S0')])

当我更改为指定大小的模式时,会得到输入:

When I change to mode with specified size I get input:

>>> data.seek(0)
0
>>> numpy.loadtxt(data, dtype=[("1",(str,30)), ("2",(str,30)), ("3",('S',30))])
array([("b'test'", "b'data'", b'lololol'),
       ("b'test2'", "b'd4t4'", b'ololol')], 
      dtype=[('1', '<U30'), ('2', '<U30'), ('3', '|S30')])

我可能会选择SU都可以.在我的情况下,该字段应用于保存一组文本标志.像linux环境变量之类的东西.因此,以防万一预分配大空间似乎是一大浪费.尤其是当行数达到数百万时.

I'd be fine with either S or U probably. The field in my case is supposed to be used to hold set of textual flags. Something like linux environmental variables. Thus, preallocating large space just in case seems like a big waste. Especially when number of rows goes into millions.

我确实知道或有主意,这些设计可以从何而来.就像构造一个struct一样的对象,该对象将整个行保存在连续的内存块中.但是,我认为也许有一种方法可以使它像字符串一样保持指针的状态.

I do understand, or have ideas, where such design can come from. Like constructing a struct like object that holds whole row in continuous memory block. However, I thought maybe there could a way to make it keep like a pointer in case of strings.

有可能吗?

推荐答案

以numpy格式获取索引 使用np.recfromtxt,它可以自动生成dtype.实际上,它使用dtype=None调用np.genfromtxt.

getting indices in numpy uses np.recfromtxt, which can generate the dtype automatically. Effectively it calls np.genfromtxt with a dtype=None.

数据类似:

david weight_2005 50
david weight_2012 60
david height_2005 150
david height_2012 160

产生一个:

array([('david', 'weight_2005', 50), ('david', 'weight_2012', 60),
       ('david', 'height_2005', 150), ('david', 'height_2012', 160),...], 
      dtype=[('f0', 'S5'), ('f1', 'S11'), ('f2', '<i4')])

genfromtxt中用于确定dtype的代码看起来很复杂.我猜想它会调整Snn以适应它在该字段中遇到的最长的字符串.

The code in genfromtxt for determining dtype looks complex. My guess it adjusts the Snn to accommodate the longest string that it encounters in that field.

自定义dtype的一种方法是在getnfromtxt中分配names,然后使用astype重铸值.

One way to customize the dtype is to assign names in getnfromtxt, and recast the values after with astype.

x=np.genfromtxt('stack19944408.txt',dtype=None,names=['one','two','thr'])
x.astype(dtype=[('one','S10'),('two','S10'),('thr','f')])
#array([('david', 'weight_200', 50.0), ('david', 'weight_201', 60.0),
#       ...
#      dtype=[('one', 'S10'), ('two', 'S10'), ('thr', '<f4')])

这篇关于numpy中的可变/未知长度字符串/unicode dtype的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆