将字符串操作应用于numpy数组? [英] Applying string operations to numpy arrays?

查看:69
本文介绍了将字符串操作应用于numpy数组?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否有更好的方法将字符串操作应用于ndarray而不是遍历它们?我想使用向量化"操作,但我只能想到使用map(显示示例)或列表推导.

Are there better ways to apply string operations to ndarrays rather than iterating over them? I would like to use a "vectorized" operation, but I can only think of using map (example shown) or list comprehensions.

Arr = numpy.rec.fromrecords(zip(range(5),'as far as i know'.split()),
                            names='name, strings')

print ''.join(map(lambda x: x[0].upper()+'.',Arr['strings']))
=> A.F.A.I.K.

例如,在R语言中,字符串操作也被向量化了:

For instance, in the R language, string operations are also vectorized:

> (string <- unlist(strsplit("as far as i know"," ")))
[1] "as"   "far"  "as"   "i"    "know"
> paste(sprintf("%s.",toupper(substr(string,1,1))),collapse="")
[1] "A.F.A.I.K."

推荐答案

更新:请参见拉曼的回答这个问题:Numpy最近为基本的字符串操作添加了numpy.char模块.

Update: See Larsman's answer to this question: Numpy recently added a numpy.char module for basic string operations.

简短答案:Numpy不提供矢量化字符串操作.惯用的方法是执行类似的操作(其中Arr是您的numpy数组):

Short answer: Numpy doesn't provide vectorized string operations. The idiomatic way is to do something like (where Arr is your numpy array):

print '.'.join(item.upper() for item in Arr['strings'])

长答案,这就是为什么numpy不提供矢量化字符串操作的原因:(以及两者之间的很多杂乱无章)

Long answer, here's why numpy doesn't provide vectorized string operations: (and a good bit of rambling in between)

就数据结构而言,一种尺寸并不能满足所有需求.

对于非特定领域编程语言的人来说,您的问题似乎很奇怪,但是对于特定领域编程语言的人来说,这很有意义.

Your question probably seems odd to people coming from a non-domain-specific programming language, but it makes a lot of sense to people coming from a domain-specific language.

Python为您提供了多种数据结构选择.一些数据结构在某些任务上比其他任务更好.

Python gives you a wide variety of choices of data structures. Some data structures are better at some tasks than others.

首先,numpy数组不是python中默认的"hold-all"容器. Python的内置容器非常擅长于其设计用途.通常,列表或字典就是您想要的.

First off, numpy array's aren't the default "hold-all" container in python. Python's builtin containers are very good at what they're designed for. Often, a list or a dict is what you want.

Numpy的ndarray用于同质数据.

Numpy's ndarrays are for homogenous data.

简而言之,numpy没有向量化的字符串操作.

In a nutshell, numpy doesn't have vectorized string operations.

ndarray s是一个专门的容器,致力于以尽可能小的内存量存储N维齐整的组项目.重点实际上是最大程度地减少内存使用(我有偏见,因为这主要是我需要它们使用的,但这是思考它的一种有用方法.).向量化的数学运算只是将事物存储在连续的内存块中的一个很好的副作用.

ndarrays are a specialized container focusing on storing N-dimensional homogenous groups of items in the minimum amount of memory possible. The emphasis is really on minimizing memory usage (I'm biased, because that's mostly what I need them for, but it's a useful way to think of it.). Vectorized mathematical operations are just a nice side effect of having things stored in a contiguous block of memory.

字符串通常具有不同的长度.

例如['Dog', 'Cat', 'Horse']. Numpy采用类似于数据库的方法,要求您定义字符串的长度,但是简单的事实是,字符串不应该是固定的长度,这有很多含义.

E.g. ['Dog', 'Cat', 'Horse']. Numpy takes the database-like approach of requiring you to define a length for your strings, but the simple fact that strings aren't expected to be a fixed length has a lot of implications.

大多数有用的字符串操作都返回可变长度的字符串. (例如您的示例中的'.'.join(...))

Most useful string operations return variable length strings. (e.g. '.'.join(...) in your example)

如果不需要,您可以模仿其他操作(例如鞋面等). (例如,鞋帮大约是(x.view(np.uint8) - 32).view('S1').我不建议您这样做,但是您可以...)

Those that don't (e.g. upper, etc) you can mimic with other operations if you want to. (E.g. upper is roughly (x.view(np.uint8) - 32).view('S1'). I don't recommend that you do that, but you can...)

作为一个基本示例:'A' + 'B'产生'AB'. 'AB'的长度与'A''B'的长度不同. Numpy会执行其他操作(例如np.uint8(4) + np.float(3.4)),但是字符串的长度比数字的长度灵活得多. (数字的上播"和下播"规则非常简单.)

As a basic example: 'A' + 'B' yields 'AB'. 'AB' is not the same length as 'A' or 'B'. Numpy deals with other things that do this (e.g. np.uint8(4) + np.float(3.4)), but strings are much more flexible in length than numbers. ("Upcasting" and "downcasting" rules for numbers are pretty simple.)

numpy不这样做的另一个原因是,重点放在了数字运算上. 'A'**2在python中没有特别的定义(您当然可以创建一个字符串类,但是应该是什么?).字符串数组是numpy中的二等公民.它们存在,但是大多数操作都没有为它们定义.

Another reason numpy doesn't do it is that the focus is on numerical operations. 'A'**2 has no particular definition in python (You can certainly make a string class that does, but what should it be?). String arrays are second class citizens in numpy. They exist, but most operations aren't defined for them.

Python已经确实擅长处理字符串处理

Python is already really good at handling string processing

numpy不尝试提供字符串操作的另一个(实际上是主要的)原因是python已经非常好.

The other (and really, the main) reason numpy doesn't try to offer string operations is that python is already really good at it.

列表是出色的灵活容器. Python有很多非常好的,非常快的字符串操作.列表推导和生成器表达式非常快,并且它们不必在乎尝试猜测返回项的类型或大小应该是什么,因此不会增加任何开销. (他们只是存储指向它的指针.)

Lists are fantastic flexible containers. Python has a huge set of very nice, very fast string operations. List comprehensions and generator expressions are fairly fast, and they don't suffer any overhead from trying to guess what the type or size of the returned item should be, as they don't care. (They just store a pointer to it.)

此外,在python中遍历numpy数组比在python中遍历列表或元组要慢,但是对于字符串操作,最好只使用普通的list/generator表达式. (例如您的示例中的print '.'.join(item.upper() for item in Arr['strings']))更好的是,不要首先使用numpy数组存储字符串.如果您的结构化数组中只有一列包含字符串,这是很有意义的,但仅此而已. Python为您提供了非常丰富而灵活的数据结构. Numpy数组不是全部,也不是全部,它们是一个特殊的情况,而不是一个广义的情况.

Also, iterating over numpy arrays in python is slower than iterating over a list or tuple in python, but for string operations, you're really best off just using the normal list/generator expressions. (e.g. print '.'.join(item.upper() for item in Arr['strings']) in your example) Better yet, don't use numpy arrays to store strings in the first place. It makes sense if you have a single column of a structured array with strings, but that's about it. Python gives you very rich and flexible data structures. Numpy arrays aren't the be-all and end-all, and they're a specialized case, not a generalized case.

此外,请记住,您最想使用numpy数组

Also, keep in mind that most of what you'd want to do with a numpy array

学习Python,而不仅仅是Numpy

在这里我不想太厚脸皮,但是使用numpy数组与Matlab或R或IDL等中的很多东西非常相似.

I'm not trying to be cheeky here, but working with numpy arrays is very similar to a lot of things in Matlab or R or IDL, etc.

这是一个熟悉的范例,任何人的第一个直觉是尝试将相同的范例应用于其余语言.

It's a familiar paradigm, and anyone's first instinct is to try to apply that same paradigm to the rest of the language.

Python不仅仅是numpy.它是一种多范式语言,因此很容易遵循您已经习惯的范式.尝试学习用python进行思考"以及用numpy进行思考". Numpy为python提供了一种特定的范例,但还有很多,并且某些范例比其他范例更适合某些任务.

Python is a lot more than just numpy. It's a multi-paradigm language, so it's easy to stick to the paradigms that you're already used to. Try to learn to "think in python" as well as just "thinking in numpy". Numpy provides a specific paradigm to python, but there's a lot more there, and some paradigms are a better fit for some tasks than others.

部分原因是熟悉不同数据容器(列表,字典,元组等)的优缺点,以及不同的编程范例(例如,面向对象,功能,过程等).

Part of this is becoming familiar with the strengths and weaknesses of different data containers (lists vs dicts vs tuples, etc), as well as different programming paradigms (e.g. object-oriented vs functional vs procedural, etc).

总而言之,python具有几种不同类型的专用数据结构.这使其与R或Matlab等领域特定的语言有所不同,后者具有几种类型的数据结构,但着重于用一种特定的结构来完成所有事情. (我在R方面的经验有限,所以我可能在那儿错了,但是无论如何,这就是我的印象.无论如何,Matlab的确如此.)

All in all, python has several different types of specialized data structures. This makes it somewhat different from domain-specific languages like R or Matlab, which have a few types of data structures, but focus on doing everything with one specific structure. (My experience with R is limited, so I may be wrong there, but that's my impression of it, anyway. It's certainly true of Matlab, anyway.)

无论如何,我并不是想在这里大声疾呼,但是我花了相当长的时间才停止在Matlab中编写Fortran,并且花了我更长的时间才停止在python中编写Matlab.这个杂乱无章的答案在具体示例上是很合理的,但希望它至少有意义一点,并有所帮助.

At any rate, I'm not trying to rant here, but it took me quite awhile to stop writing Fortran in Matlab, and it took me even longer to stop writing Matlab in python. This rambling answer is very sort on concrete examples, but hopefully it makes at least a little bit of sense, and helps somewhat.

这篇关于将字符串操作应用于numpy数组?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆