Python CSV - 需要汇总按另一列中的值分组的列中的值 [英] Python CSV - Need to sum up values in a column grouped by value in another column
问题描述
我有一个csv中的数据需要解析。它看起来像:
I have data in a csv that needs to be parsed. It looks like:
Date, Name, Subject, SId, Mark
2/2/2013, Andy Cole, History, 216351, 98
2/2/2013, Andy Cole, Maths, 216351, 87
2/2/2013, Andy Cole, Science, 217387, 21
2/2/2013, Bryan Carr, Maths, 216757, 89
2/2/2013, Carl Jon, Botany, 218382, 78
2/2/2013, Bryan Carr, Biology, 216757, 27
我需要使用Sid作为键,并使用此键汇总标记列中的所有值。
输出结果如下:
I need to have Sid as the key and sum up all the values in mark column using this key. The output would be something like:
Sid Mark
216351 185
217387 21
216757 116
218382 78
我不必将输出写入文件。我只需要它,当我执行python文件。
这是一个类似的问题。
如何改变以跳过之间的列?
I do not have to write the output on a file. I will just need it when I execute the python file. This is a similar question. How should that be changed to skip the columns in between?
推荐答案
这是直方图的概念。从 collections
中使用 defaultdict(int)
,并遍历所有行。使用'sid'值作为dict的关键字,并将Mark值添加到当前值。
This is the concept of a histogram. Use a defaultdict(int)
from collections
and iterate through your rows. Use the 'Sid' value as key for the dict and add the 'Mark' value to the current value.
int类型的defaultdict确保如果一个键不存在到目前为止它的值被初始化为0。
The defaultdict of type int makes sure that if a key is not existing so far its value becomes initialized with 0.
from collections import defaultdict
d = defaultdict(int)
with open("data.txt") as f:
for line in f:
tokens = [t.strip() for t in line.split(",")]
try:
sid = int(tokens[3])
mark = int(tokens[4])
except ValueError:
continue
d[sid] += mark
print d
defaultdict(<type 'int'>, {217387: 21, 216757: 116, 218382: 78, 216351: 185})
您可以将解析部分更改为其他任何内容(例如使用 csvreader
或执行其他验证)。这里的关键是使用 defaultdict(int)
并更新它:
You can change the parsing part to anything else (e.g. use csvreader
or perform other validations). The key point here is to use a the defaultdict(int)
and to update it like so:
d[sid] += mark
这篇关于Python CSV - 需要汇总按另一列中的值分组的列中的值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!