在一栏中按发生频率对整个csv进行排序 [英] Sorting entire csv by frequency of occurence in one column
问题描述
我有一个很大的csv文件,它是呼叫者数据的日志.
I have a large csv file, which is a log of caller data.
我的文件的一小段:
CompanyName High Priority QualityIssue
Customer1 Yes User
Customer1 Yes User
Customer2 No User
Customer3 No Equipment
Customer1 No Neither
Customer3 No User
Customer3 Yes User
Customer3 Yes Equipment
Customer4 No User
我想按客户出现的频率对整个列表进行排序,所以它会像这样:
I want to sort the entire list by the frequency of occurrence of customers so it will be like:
CompanyName High Priority QualityIssue
Customer3 No Equipment
Customer3 No User
Customer3 Yes User
Customer3 Yes Equipment
Customer1 Yes User
Customer1 Yes User
Customer1 No Neither
Customer2 No User
Customer4 No User
我尝试了groupby,但是只打印了公司名称和频率,而没有打印其他列,我也尝试了
I've tried groupby, but that only prints out the Company Name and the frequency but not the other columns, I also tried
df['Totals']= [sum(df['CompanyName'] == df['CompanyName'][i]) for i in xrange(len(df))]
和
df = [sum(df['CompanyName'] == df['CompanyName'][i]) for i in xrange(len(df))]
但是这些给了我错误:ValueError:传递的项目数错误1,索引表示24
But these give me errors: ValueError: Wrong number of items passed 1, indices imply 24
我看过这样的东西:
for key, value in sorted(mydict.iteritems(), key=lambda (k,v): (v,k)):
print "%s: %s" % (key, value)
但是这只会打印出两列,我想对整个csv进行排序.我的输出应该是第一列排序的整个csv.
but this only prints out two columns, and I want to sort my entire csv. My output should be my entire csv sorted by the first column.
谢谢您的帮助!
推荐答案
这似乎可以满足您的要求,基本上是通过执行 transform
和
This seems to do what you want, basically add a count column by performing a groupby
and transform
with value_counts
and then you can sort on that column:
In [22]:
df['count'] = df.groupby('CompanyName')['CompanyName'].transform(pd.Series.value_counts)
df.sort('count', ascending=False)
Out[22]:
CompanyName HighPriority QualityIssue count
5 Customer3 No User 4
3 Customer3 No Equipment 4
7 Customer3 Yes Equipment 4
6 Customer3 Yes User 4
0 Customer1 Yes User 3
4 Customer1 No Neither 3
1 Customer1 Yes User 3
8 Customer4 No User 1
2 Customer2 No User 1
You can drop the extraneous column using df.drop
:
In [24]:
df.drop('count', axis=1)
Out[24]:
CompanyName HighPriority QualityIssue
5 Customer3 No User
3 Customer3 No Equipment
7 Customer3 Yes Equipment
6 Customer3 Yes User
0 Customer1 Yes User
4 Customer1 No Neither
1 Customer1 Yes User
8 Customer4 No User
2 Customer2 No User
这篇关于在一栏中按发生频率对整个csv进行排序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!