将 NumPy 字符串数组映射到整数 [英] Map a NumPy array of strings to integers

查看:36
本文介绍了将 NumPy 字符串数组映射到整数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

问题:

给定一个字符串数据数组

dataSet = np.array(['kevin', 'greg', 'george', 'kevin'], dtype='U21'),

我想要一个返回索引数据集的函数

indexed_dataSet = np.array([0, 1, 2, 0], dtype='int')

和查找表

lookupTable = np.array(['kevin', 'greg', 'george'], dtype='U21')

这样

(lookupTable[indexed_dataSet] == dataSet).all()

是真的.请注意,indexed_dataSetlookupTable 都可以被置换,这样上面的内容就成立并且很好(即 ​​lookupTable 等价于 dataSet 中第一次出现的顺序).

缓慢的解决方案:

我目前有以下缓慢的解决方案

def indexDataSet(dataSet):"""返回索引数据集和查找表输入:dataSet :要索引的长度为 n 的 numpy 数组输出:indexed_dataSet :一个长度为 n 的 numpy 数组,包含 {0, len(set(dataSet))-1} 中的值lookupTable :查找表,这样 lookupTable[indexed_Dataset] = dataSet"""标签 = 集(数据集)查找表 = np.empty(len(labels), dtype='U21')indexed_dataSet = np.zeros(dataSet.size, dtype='int')计数 = -1对于标签中的标签:计数 += 1indexed_dataSet[np.where(dataSet == label)] = count查找表[计数] = 标签返回 indexed_dataSet、lookupTable

有没有更快的方法来做到这一点?我觉得我没有在这里充分发挥 numpy 的潜力.

解决方案

您可以使用 np.uniquereturn_inverse 参数:

<预><代码>>>>lookupTable, indexed_dataSet = np.unique(dataSet, return_inverse=True)>>>查找表数组(['乔治','格雷格','凯文'],dtype='<U21')>>>indexed_dataSet数组([2, 1, 0, 2])

如果您愿意,可以从这两个数组重建原始数组:

<预><代码>>>>查找表[indexed_dataSet]数组(['凯文','格雷格','乔治','凯文'],dtype='<U21')

如果您使用 Pandas,lookupTable, indexed_dataSet = pd.factorize(dataSet) 将实现相同的效果(并且对于大型数组可能更高效).

Problem:

Given an array of string data

dataSet = np.array(['kevin', 'greg', 'george', 'kevin'], dtype='U21'), 

I would like a function that returns the indexed dataset

indexed_dataSet = np.array([0, 1, 2, 0], dtype='int')

and a lookup table

lookupTable = np.array(['kevin', 'greg', 'george'], dtype='U21')

such that

(lookupTable[indexed_dataSet] == dataSet).all()

is true. Note that the indexed_dataSet and lookupTable can both be permuted such that the above holds and that is fine (i.e. it is not necessary that the order of lookupTable is equivalent to the order of first appearance in dataSet).

Slow Solution:

I currently have the following slow solution

def indexDataSet(dataSet):
    """Returns the indexed dataSet and a lookup table
       Input:
           dataSet         : A length n numpy array to be indexed
       Output:
           indexed_dataSet : A length n numpy array containing values in {0, len(set(dataSet))-1}
           lookupTable     : A lookup table such that lookupTable[indexed_Dataset] = dataSet"""
    labels = set(dataSet)
    lookupTable = np.empty(len(labels), dtype='U21')
    indexed_dataSet = np.zeros(dataSet.size, dtype='int')
    count = -1
    for label in labels:
        count += 1
        indexed_dataSet[np.where(dataSet == label)] = count
        lookupTable[count] = label

    return indexed_dataSet, lookupTable

Is there a quicker way to do this? I feel like I am not using numpy to its full potential here.

解决方案

You can use np.unique with the return_inverse argument:

>>> lookupTable, indexed_dataSet = np.unique(dataSet, return_inverse=True)
>>> lookupTable
array(['george', 'greg', 'kevin'], 
      dtype='<U21')
>>> indexed_dataSet
array([2, 1, 0, 2])

If you like, you can reconstruct your original array from these two arrays:

>>> lookupTable[indexed_dataSet]
array(['kevin', 'greg', 'george', 'kevin'], 
      dtype='<U21')

If you use pandas, lookupTable, indexed_dataSet = pd.factorize(dataSet) will achieve the same thing (and potentially be more efficient for large arrays).

这篇关于将 NumPy 字符串数组映射到整数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆