计算CSV文件中Python中的特定事件 [英] Counting particular occurrences in python in csv file

查看:80
本文介绍了计算CSV文件中Python中的特定事件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含4列{Tag,User,Quality,Cluster_id}的csv文件.使用python,我想执行以下操作:对于每个cluster_id(从1到500),我想为每个用户查看好标签和坏标签的数量(从quality列中获得).有超过6000个用户.我只能在csv文件中逐行读取.因此,我不确定该怎么做.

I have a csv file with 4 columns {Tag, User, Quality, Cluster_id}. Using python I would like to do the following: For every cluster_id (from 1 to 500), I want to see for each user, the number of good and bad tags(Obtained from the quality column). There are more than 6000 users. I can read only row by row in the csv file. Hence, I am not sure how this can be done.

例如:

Columns of csv = [Tag User Quality Cluster]   
Row1= [bag  u1  good     1]  
Row2 = [ground u2 bad   2]  
Row3 = [xxx  u1 bad  1]  
Row4 = [bbb  u2 good 3]  

我刚刚设法获取了csv文件的每一行.

I have just managed to get each row of the csv file.

我一次只能访问每一行,不能有两个for循环.我要实现的算法的伪码是:

I can only access each row at a time, not have two for loops. The psedudocode of the algorithm I want to implement is:

for cluster in clusters:  
    for user in users:  
        if eval == good:  
            good_num = good_num +1  
        else:  
            bad_num = bad_num + 1

推荐答案

由于某人已经发布了defaultdict解决方案,因此我将提供一个熊猫一个,只是为了多样性. pandas是用于数据处理的非常方便的库.除了其他出色的功能外,它还可以根据需要的输出类型在一行中处理该计数问题.真的:

Since someone's already posted a defaultdict solution, I'm going to give a pandas one, just for variety. pandas is a very handy library for data processing. Among other nice features, it can handle this counting problem in one line, depending on what kind of output is required. Really:

df = pd.read_csv("cluster.csv")
counted = df.groupby(["Cluster_id", "User", "Quality"]).size()
df.to_csv("counted.csv")

-

只需提供一个pandas易用性的预告片,我们就可以加载文件-pandas中的主要数据存储对象称为"DataFrame":

Just to give a trailer for what pandas makes easy, we can load the file -- the main data storage object in pandas is called a "DataFrame":

>>> import pandas as pd
>>> df = pd.read_csv("cluster.csv")
>>> df
<class 'pandas.core.frame.DataFrame'>
Int64Index: 500000 entries, 0 to 499999
Data columns:
Tag           500000  non-null values
User          500000  non-null values
Quality       500000  non-null values
Cluster_id    500000  non-null values
dtypes: int64(1), object(3)

我们可以检查一下前几行是否正常:

We can check that the first few rows look okay:

>>> df[:5]
   Tag  User Quality  Cluster_id
0  bbb  u001     bad          39
1  bbb  u002     bad          36
2  bag  u003    good          11
3  bag  u004    good           9
4  bag  u005     bad          26

然后我们可以按Cluster_id和User分组,并在每个组上进行工作:

and then we can group by Cluster_id and User, and do work on each group:

>>> for name, group in df.groupby(["Cluster_id", "User"]):
...     print 'group name:', name
...     print 'group rows:'
...     print group
...     print 'counts of Quality values:'
...     print group["Quality"].value_counts()
...     raw_input()
...     
group name: (1, 'u003')
group rows:
        Tag  User Quality  Cluster_id
372002  xxx  u003     bad           1
counts of Quality values:
bad    1

group name: (1, 'u004')
group rows:
           Tag  User Quality  Cluster_id
126003  ground  u004     bad           1
348003  ground  u004    good           1
counts of Quality values:
good    1
bad     1

group name: (1, 'u005')
group rows:
           Tag  User Quality  Cluster_id
42004   ground  u005     bad           1
258004  ground  u005     bad           1
390004  ground  u005     bad           1
counts of Quality values:
bad    3
[etc.]

如果您要对csv文件进行大量处理,那绝对值得一看.

If you're going to be doing a lot of processing of csv files, it's definitely worth having a look at.

这篇关于计算CSV文件中Python中的特定事件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆