将NumPy字符串数组映射为整数 [英] Map a NumPy array of strings to integers
问题描述
问题:
给出一个字符串数据数组
Given an array of string data
dataSet = np.array(['kevin', 'greg', 'george', 'kevin'], dtype='U21'),
我想要一个返回索引数据集的函数
I would like a function that returns the indexed dataset
indexed_dataSet = np.array([0, 1, 2, 0], dtype='int')
和一个查询表
lookupTable = np.array(['kevin', 'greg', 'george'], dtype='U21')
如此
(lookupTable[indexed_dataSet] == dataSet).all()
是真的.请注意,indexed_dataSet
和lookupTable
都可以进行排列,使得上面的内容成立并且很好(即,lookupTable
的顺序不必等于dataSet
中的首次出现顺序).
is true. Note that the indexed_dataSet
and lookupTable
can both be permuted such that the above holds and that is fine (i.e. it is not necessary that the order of lookupTable
is equivalent to the order of first appearance in dataSet
).
慢速解决方案:
我目前有以下慢速解决方案
I currently have the following slow solution
def indexDataSet(dataSet):
"""Returns the indexed dataSet and a lookup table
Input:
dataSet : A length n numpy array to be indexed
Output:
indexed_dataSet : A length n numpy array containing values in {0, len(set(dataSet))-1}
lookupTable : A lookup table such that lookupTable[indexed_Dataset] = dataSet"""
labels = set(dataSet)
lookupTable = np.empty(len(labels), dtype='U21')
indexed_dataSet = np.zeros(dataSet.size, dtype='int')
count = -1
for label in labels:
count += 1
indexed_dataSet[np.where(dataSet == label)] = count
lookupTable[count] = label
return indexed_dataSet, lookupTable
有更快的方法吗?我觉得我没有在这里充分发挥numpy的潜力.
Is there a quicker way to do this? I feel like I am not using numpy to its full potential here.
推荐答案
You can use np.unique
with the return_inverse
argument:
>>> lookupTable, indexed_dataSet = np.unique(dataSet, return_inverse=True)
>>> lookupTable
array(['george', 'greg', 'kevin'],
dtype='<U21')
>>> indexed_dataSet
array([2, 1, 0, 2])
如果愿意,您可以从以下两个数组重建原始数组:
If you like, you can reconstruct your original array from these two arrays:
>>> lookupTable[indexed_dataSet]
array(['kevin', 'greg', 'george', 'kevin'],
dtype='<U21')
如果使用熊猫,则lookupTable, indexed_dataSet = pd.factorize(dataSet)
将实现相同的效果(对于大型数组,可能会更有效).
If you use pandas, lookupTable, indexed_dataSet = pd.factorize(dataSet)
will achieve the same thing (and potentially be more efficient for large arrays).
这篇关于将NumPy字符串数组映射为整数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!