计算 Pandas DataFrame 中的重复值 [英] Counting duplicate values in Pandas DataFrame
问题描述
一定有一种简单的方法可以做到这一点,但我无法在 SO 上找到优雅的解决方案,也无法自己解决.
There must be an easy way to do this, but I was unable to find an elegant solution for on SO or work it out by myself.
我正在尝试根据 DataFrame 中的列集计算重复值的数量.
I'm trying to count the number of duplicate values based on set of columns in a DataFrame.
示例:
print df
Month LSOA code Longitude Latitude Crime type
0 2015-01 E01000916 -0.106453 51.518207 Bicycle theft
1 2015-01 E01000914 -0.111497 51.518226 Burglary
2 2015-01 E01000914 -0.111497 51.518226 Burglary
3 2015-01 E01000914 -0.111497 51.518226 Other theft
4 2015-01 E01000914 -0.113767 51.517372 Theft from the person
我的解决方法:
counts = dict()
for i, row in df.iterrows():
key = (
row['Longitude'],
row['Latitude'],
row['Crime type']
)
if counts.has_key(key):
counts[key] = counts[key] + 1
else:
counts[key] = 1
我得到了计数:
{(-0.11376700000000001, 51.517371999999995, 'Theft from the person'): 1,
(-0.111497, 51.518226, 'Burglary'): 2,
(-0.111497, 51.518226, 'Other theft'): 1,
(-0.10645299999999999, 51.518207000000004, 'Bicycle theft'): 1}
除了此代码也可以改进(请随意评论如何改进)之外,还有什么方法可以通过 Pandas 完成?
Aside from the fact this code could be improved as well (feel free to comment how), what would be the way to do it through Pandas?
对于那些感兴趣的人,我正在处理来自 https://data.police.uk/ 的数据集
For those interested I'm working on a dataset from https://data.police.uk/
推荐答案
您可以将 groupby
与 大小.然后我将索引重命名列 0
重置为 count
.
You can use groupby
with function size.
Then I reset index with rename column 0
to count
.
print df
Month LSOA code Longitude Latitude Crime type
0 2015-01 E01000916 -0.106453 51.518207 Bicycle theft
1 2015-01 E01000914 -0.111497 51.518226 Burglary
2 2015-01 E01000914 -0.111497 51.518226 Burglary
3 2015-01 E01000914 -0.111497 51.518226 Other theft
4 2015-01 E01000914 -0.113767 51.517372 Theft from the person
df = df.groupby(['Longitude', 'Latitude', 'Crime type']).size().reset_index(name='count')
print df
Longitude Latitude Crime type count
0 -0.113767 51.517372 Theft from the person 1
1 -0.111497 51.518226 Burglary 2
2 -0.111497 51.518226 Other theft 1
3 -0.106453 51.518207 Bicycle theft 1
print df['count']
0 1
1 2
2 1
3 1
Name: count, dtype: int64
这篇关于计算 Pandas DataFrame 中的重复值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!