Python:如何最好地解析csv和仅计数一个子集的值 [英] Python: How best to parse csv and count values for only a subset

查看:43
本文介绍了Python:如何最好地解析csv和仅计数一个子集的值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个CSV文件,该文件包含3列和11行的以下内容,第一行是标题.我自己创建了这个文件,以获取一个简单的文件.每个订单项都是一个单一的水果订单.

I have a CSV file with the following contents in 3 columns and 11 rows, the first row being a header. I created this myself to have a simple file to learn from. Each line item is a single order of fruit.

OrderNo      Fruit     Origin
1           Apple        NY
2           Orange       FL      
3           Banana       CA
4           Pear         NJ
5           Grapes       VA
6           Grapes       VA
7           Grapes       MD
8           Grapes       MA
9           Pineapple    HI
10          Grapes       GA

我正在尝试在Python中解析此数据,以执行以下操作:

I am trying to parse this data in Python, to do the following:

(1)确定为每种类型的水果生成最多订单的状态,(2)从每种水果的任何单个状态中确定订单的最高数量,(3)以字母顺序输出此结果,如下所示:

(1) determine the states that generate the most orders for each type of fruit and (2) determine the highest number of orders from any single state per each fruit, (3) output this result in alphabetical order like so:

Apple NY 1
Banana CA 1
Grapes VA 2
Orange FL 1
Pear NJ 1
Pineapple HI 1

用csv.reader读取csv文件后,我试图用Counter和for循环完成计数:

After reading the csv file with csv.reader, I was trying to accomplish the counting with Counter and for loops:

import csv
from collections import Counter 

cnt = Counter()
f = open("/test.csv")
reader = csv.reader(f, delimiter=",")
header = next(f) 

for row in reader:   
    cnt[row[2]] += 1 

但是有更好的方法吗?

推荐答案

我实际上会使用pandas,它是list/dictionary/spreadsheet/database的组合.它是专门为以这种方式处理数据而设计的.

I'd actually use pandas which is a combination of list/dictionary/spreadsheet/database. It is specifically designed for manipulating data in this way.

import pandas as pd
from collections import defaultdict

path_to_file = "/test.csv"
df = pd.read_csv(path_to_file)

groups = df.groupby(['Fruit', 'Origin'])
max_for_fruit = defaultdict(int) #first pass through the groups, store the maximum for each fruit to handle ties

for g in groups:
    fruit, count = g[0][0], len(g[1])
    max_for_fruit[ fruit ] = max( max_for_fruit[fruit], count )

for g in groups:
    fruit, state, count = g[0][0], g[0][1], len(g[1])
    if count == max_for_fruit[ fruit ]:
        print( "{} {} {}".format(fruit, state, count ) )

这是输出.

Apple NY 1
Banana CA 1
Grapes VA 2
Orange FL 1
Pear NJ 1
Pineapple HI 1

http://pandas.pydata.org/pandas-docs/stable/groupby.html

http://pandas.pydata.org/pandas-docs/stable/generation/pandas.io.parsers.read_csv.html

http://pandas.pydata.org/pandas-docs/stable/tutorials.html

这篇关于Python:如何最好地解析csv和仅计数一个子集的值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆