从numpy数组中的每个字符串中提取第一个字母 [英] Extract the first letter from each string in a numpy array

查看:148
本文介绍了从numpy数组中的每个字符串中提取第一个字母的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个巨大的numpy数组,其中的元素是字符串.我喜欢用字符串的第一个字母替换字符串.例如,如果

I got a huge numpy array where elements are strings. I like to replace the strings with the first alphabet of the string. For example if

C [0] ='A90CD'

C[0] = 'A90CD'

我想用

C[0] = 'A'

简而言之,我正在考虑将regex应用到一个循环中,在该循环中我有一个regp表达字典,例如

IN nutshell, I was thinking of applying regex in a loop where I have a dictionary of regex expression like

'^ A.+ $'=>'A'

'^A.+$' => 'A'

'^ B.+ $'=>'B'等等

'^B.+$' => 'B' etc

如何在numpy数组上应用此正则表达式?还是有更好的方法来达到相同的目的?

How can I apply this regex over the numpy arrays ? Or is there any better method to achieve the same ?

推荐答案

这里不需要正则表达式.只需使用 astype -

There's no need for regex here. Just convert your array to a 1 byte string, using astype -

v = np.array(['abc', 'def', 'ghi'])

>>> v.astype('<U1')
array(['a', 'd', 'g'],
      dtype='<U1')

或者,您可以更改其 view 并大步前进.这是用于大小相等的字符串的稍微优化的版本.-

Alternatively, you change its view and stride. Here's a slightly optimised version for equal sized strings. -

>>> v.view('<U1')[::len(v[0])]
array(['a', 'd', 'g'],
      dtype='<U1')

这是 .view 方法的更通用的版本,但这适用于长度不同的字符串数组.感谢Paul Panzer的建议-

And here's the more generalised version of .view method, but this works for arrays of strings with differing length. Thanks to Paul Panzer for the suggestion -

>>> v.view('<U1').reshape(v.shape + (-1,))[:, 0]
array(['a', 'd', 'g'],
      dtype='<U1')

性能

y = np.array([x * 20 for x in v]).repeat(100000)

y.shape
(300000,)

len(y[0])   # they're all the same length - `abcabcabc...`
60

现在,时间-

# `astype` conversion

%timeit y.astype('<U1')
100 loops, best of 3: 5.03 ms per loop

# `view` for equal sized string arrays 

%timeit y.view('<U1')[::len(y[0])]
100000 loops, best of 3: 2.43 µs per loop

# Paul Panzer's version for differing length strings

%timeit y.view('<U1').reshape(y.shape + (-1,))[:, 0]
100000 loops, best of 3: 3.1 µs per loop

view 方法快很多.

但是,请谨慎使用,因为内存是共享的.

The view method is faster by a huge margin.

However, use with caution, as the memory is shared.

如果您对找到您的第一个字母的更通用解决方案感兴趣(不管它在哪里),我想说最快/最简单的方法是使用 re 模块,编译模式并在列表理解内进行搜索.

If you're interested in a more general solution that finds you the first letter (regardless of where it may be), I'd say the fastest/easiest way would be using the re module, compiling a pattern and searching inside a list comprehension.

>>> p = re.compile('[a-zA-Z]')
>>> [p.search(x).group() for x in v]
['a', 'd', 'g']

而且,它在上述相同设置下的性能-

And, its performance on the same setup above -

%timeit [p.search(x).group() for x in y]
1 loop, best of 3: 320 ms per loop

这篇关于从numpy数组中的每个字符串中提取第一个字母的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆