pandas 从列表列中获取唯一值 [英] pandas get unique values from column of lists
问题描述
如何在 pandas 或 numpy 中获取一列列表的唯一值,以便第二列来自
会导致'action', 'crime', 'drama'
.
我能想到的最接近(但非功能性)的解决方案是:
流派 = data['流派'].unique()
但这可以预见地导致 TypeError 说明列表如何不可散列.
TypeError: unhashable type: 'list'
Set 似乎是个好主意,但是
流派 = data.apply(set(), columns=['Genre'], axis=1)
但也会导致TypeError: set() 不接受关键字参数
如果您只想找到唯一值,我建议使用 itertools.chain.from_iterable
来连接所有这些列表
import itertools>>>np.unique([*itertools.chain.from_iterable(df.Genre)])数组(['动作','犯罪','戏剧'],dtype='<U6')
甚至更快
<预><代码>>>>设置(itertools.chain.from_iterable(df.Genre)){'动作'、'犯罪'、'戏剧'}时间
df = pd.DataFrame({'Genre':[['crime','drama'],['action','crime','drama']]})df = pd.concat([df]*10000)%timeit set(itertools.chain.from_iterable(df.Genre))100 个循环,最好的 3 个:每个循环 2.55 毫秒%timeit set([x for y in df['Genre'] for x in y])100 个循环,最好的 3 个:每个循环 4.09 毫秒%timeit np.unique([*itertools.chain.from_iterable(df.Genre)])100 个循环,最好的 3 个:每个循环 12.8 毫秒%timeit np.unique(df['流派'].sum())1 个循环,最好的 3 个:每个循环 1.65 秒%timeit set(df['流派'].sum())1 个循环,最好的 3 个:每个循环 1.66 秒
How do I get the unique values of a column of lists in pandas or numpy such that second column from
would result in 'action', 'crime', 'drama'
.
The closest (but non-functional) solutions I could come up with were:
genres = data['Genre'].unique()
But this predictably results in a TypeError saying how lists aren't hashable.
TypeError: unhashable type: 'list'
Set seemed to be a good idea but
genres = data.apply(set(), columns=['Genre'], axis=1)
but also results in a
TypeError: set() takes no keyword arguments
If you only want to find the unique values, I'd recommend using itertools.chain.from_iterable
to concatenate all those lists
import itertools
>>> np.unique([*itertools.chain.from_iterable(df.Genre)])
array(['action', 'crime', 'drama'], dtype='<U6')
Or even faster
>>> set(itertools.chain.from_iterable(df.Genre))
{'action', 'crime', 'drama'}
Timings
df = pd.DataFrame({'Genre':[['crime','drama'],['action','crime','drama']]})
df = pd.concat([df]*10000)
%timeit set(itertools.chain.from_iterable(df.Genre))
100 loops, best of 3: 2.55 ms per loo
%timeit set([x for y in df['Genre'] for x in y])
100 loops, best of 3: 4.09 ms per loop
%timeit np.unique([*itertools.chain.from_iterable(df.Genre)])
100 loops, best of 3: 12.8 ms per loop
%timeit np.unique(df['Genre'].sum())
1 loop, best of 3: 1.65 s per loop
%timeit set(df['Genre'].sum())
1 loop, best of 3: 1.66 s per loop
这篇关于 pandas 从列表列中获取唯一值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!