在一栏中按发生频率对整个csv进行排序 [英] Sorting entire csv by frequency of occurence in one column

查看:88
本文介绍了在一栏中按发生频率对整个csv进行排序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个很大的csv文件,它是呼叫者数据的日志.

I have a large csv file, which is a log of caller data.

我的文件的一小段:

CompanyName    High Priority     QualityIssue
Customer1         Yes             User
Customer1         Yes             User
Customer2         No              User
Customer3         No              Equipment
Customer1         No              Neither
Customer3         No              User
Customer3         Yes             User
Customer3         Yes             Equipment
Customer4         No              User

我想按客户出现的频率对整个列表进行排序,所以它会像这样:

I want to sort the entire list by the frequency of occurrence of customers so it will be like:

CompanyName    High Priority     QualityIssue
Customer3         No               Equipment
Customer3         No               User
Customer3         Yes              User
Customer3         Yes              Equipment
Customer1         Yes              User
Customer1         Yes              User
Customer1         No               Neither
Customer2         No               User
Customer4         No               User

我尝试了groupby,但是只打印了公司名称和频率,而没有打印其他列,我也尝试了

I've tried groupby, but that only prints out the Company Name and the frequency but not the other columns, I also tried

df['Totals']= [sum(df['CompanyName'] == df['CompanyName'][i]) for i in xrange(len(df))]

df = [sum(df['CompanyName'] == df['CompanyName'][i]) for i in xrange(len(df))]

但是这些给了我错误:ValueError:传递的项目数错误1,索引表示24

But these give me errors: ValueError: Wrong number of items passed 1, indices imply 24

我看过这样的东西:

for key, value in sorted(mydict.iteritems(), key=lambda (k,v): (v,k)):
    print "%s: %s" % (key, value)

但是这只会打印出两列,我想对整个csv进行排序.我的输出应该是第一列排序的整个csv.

but this only prints out two columns, and I want to sort my entire csv. My output should be my entire csv sorted by the first column.

谢谢您的帮助!

推荐答案

这似乎可以满足您的要求,基本上是通过执行

This seems to do what you want, basically add a count column by performing a groupby and transform with value_counts and then you can sort on that column:

In [22]:

df['count'] = df.groupby('CompanyName')['CompanyName'].transform(pd.Series.value_counts)
df.sort('count', ascending=False)
Out[22]:
  CompanyName HighPriority QualityIssue count
5   Customer3           No         User     4
3   Customer3           No    Equipment     4
7   Customer3          Yes    Equipment     4
6   Customer3          Yes         User     4
0   Customer1          Yes         User     3
4   Customer1           No      Neither     3
1   Customer1          Yes         User     3
8   Customer4           No         User     1
2   Customer2           No         User     1

您可以使用

You can drop the extraneous column using df.drop:

In [24]:
df.drop('count', axis=1)

Out[24]:
  CompanyName HighPriority QualityIssue
5   Customer3           No         User
3   Customer3           No    Equipment
7   Customer3          Yes    Equipment
6   Customer3          Yes         User
0   Customer1          Yes         User
4   Customer1           No      Neither
1   Customer1          Yes         User
8   Customer4           No         User
2   Customer2           No         User

这篇关于在一栏中按发生频率对整个csv进行排序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆