pandas 中值的唯一从零开始的 id [英] Unique zero-based id for values in pandas
问题描述
我在带有标识符列的 DataFrame 中有一些数据.
I have some data in a DataFrame with an identifier column.
data = DataFrame({'id' : [50,50,30,10,50,50,30]})
对于每个唯一的 id,我想提出一个新的唯一标识符.我希望 id 是从 0 开始的连续整数.这是我目前所拥有的:
For each unique id, I want to come up with a new unique identifier. I'd like the ids to be sequential integers starting at 0. Here's what I have so far:
unique = data[['id']].drop_duplicates()
unique['group'] = np.arange(len(unique))
unique.set_index('id')
data = data.merge(unique, 'inner', on = 'id')
这可行,但似乎有点脏.有没有更好的办法?
This works but seems a little dirty. Is there a better way?
推荐答案
这就是 pandas.factorize
确实:
That is what pandas.factorize
does:
data = pd.DataFrame({'id' : [50,50,30,10,50,50,30]})
print pd.factorize(data.id)[0]
输出:
[0 0 1 2 0 0 1]
numpy.unique
也可以这样做:
numpy.unique
can also do this:
import numpy as np
print np.unique([50,50,30,10,50,50,30], return_inverse=True)[1]
输出:
array([2, 2, 1, 0, 2, 2, 1])
numpy.unique
输出的索引是按值排序的,所以将最小值 10 赋给索引 0.如果要使用 factorize
得到这个结果,请设置sort
给 True
的参数:
the index outputed by numpy.unique
is sorted by value, so the smallest value 10 is assigend to index 0. If you want this result by using factorize
, set sort
argument to True
:
pandas.factorize(data.id, sort=True)[0]
这篇关于 pandas 中值的唯一从零开始的 id的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!