将NumPy字符串数组映射为整数 [英] Map a NumPy array of strings to integers

查看:137
本文介绍了将NumPy字符串数组映射为整数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

问题:

给出一个字符串数据数组

Given an array of string data

dataSet = np.array(['kevin', 'greg', 'george', 'kevin'], dtype='U21'), 

我想要一个返回索引数据集的函数

I would like a function that returns the indexed dataset

indexed_dataSet = np.array([0, 1, 2, 0], dtype='int')

和一个查询表

lookupTable = np.array(['kevin', 'greg', 'george'], dtype='U21')

如此

(lookupTable[indexed_dataSet] == dataSet).all()

是真的.请注意,indexed_dataSetlookupTable都可以进行排列,使得上面的内容成立并且很好(即,lookupTable的顺序不必等于dataSet中的首次出现顺序).

is true. Note that the indexed_dataSet and lookupTable can both be permuted such that the above holds and that is fine (i.e. it is not necessary that the order of lookupTable is equivalent to the order of first appearance in dataSet).

慢速解决方案:

我目前有以下慢速解决方案

I currently have the following slow solution

def indexDataSet(dataSet):
    """Returns the indexed dataSet and a lookup table
       Input:
           dataSet         : A length n numpy array to be indexed
       Output:
           indexed_dataSet : A length n numpy array containing values in {0, len(set(dataSet))-1}
           lookupTable     : A lookup table such that lookupTable[indexed_Dataset] = dataSet"""
    labels = set(dataSet)
    lookupTable = np.empty(len(labels), dtype='U21')
    indexed_dataSet = np.zeros(dataSet.size, dtype='int')
    count = -1
    for label in labels:
        count += 1
        indexed_dataSet[np.where(dataSet == label)] = count
        lookupTable[count] = label

    return indexed_dataSet, lookupTable

有更快的方法吗?我觉得我没有在这里充分发挥numpy的潜力.

Is there a quicker way to do this? I feel like I am not using numpy to its full potential here.

推荐答案

您可以使用

You can use np.unique with the return_inverse argument:

>>> lookupTable, indexed_dataSet = np.unique(dataSet, return_inverse=True)
>>> lookupTable
array(['george', 'greg', 'kevin'], 
      dtype='<U21')
>>> indexed_dataSet
array([2, 1, 0, 2])

如果愿意,您可以从以下两个数组重建原始数组:

If you like, you can reconstruct your original array from these two arrays:

>>> lookupTable[indexed_dataSet]
array(['kevin', 'greg', 'george', 'kevin'], 
      dtype='<U21')

如果使用熊猫,则lookupTable, indexed_dataSet = pd.factorize(dataSet)将实现相同的效果(对于大型数组,可能会更有效).

If you use pandas, lookupTable, indexed_dataSet = pd.factorize(dataSet) will achieve the same thing (and potentially be more efficient for large arrays).

这篇关于将NumPy字符串数组映射为整数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆