使用和聚合值解析CSV文件,多列 [英] Parse CSV file with and aggregate values, multiple columns

查看:173
本文介绍了使用和聚合值解析CSV文件,多列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在此处调整此信息(解析CSV文件并汇总值)将多个列而不是一个相加。

I would like to adapt the post here (Parse CSV file and aggregate the values) to sum multiple columns instead of just one.

对于这些数据:

CITY,AMOUNT,AMOUNT2,AMOUNTn
London,20,21,22
Tokyo,45,46,47
London,55,56,57
New York,25,26,27

我如何获得:

CITY,AMOUNT,AMOUNT2,AMOUNTn
London,75,77,79
Tokyo,45,46,47
New York,25,26,27

我最终会有几千列,不幸的是我不能使用pandas包来完成这个任务。这里是代码我只是将所有三个AMOUNT cols合并成一个,这不是我之后

I will have several thousand columns eventually, and unfortunately I can not use the pandas package for this task. Here is the code I have just aggregates all three AMOUNT cols into one, which is not what I am after

from __future__ import division
import csv
from collections import defaultdict

def default_factory():
    return [0, None, None, 0]

reader = csv.DictReader(open('test_in.txt'))
cities = defaultdict(default_factory)
for row in reader:
    headers = [r for r in row.keys()]
    headers.remove('CITY')
    for i in headers:
        amount = int(row[i])
        cities[row["CITY"]][0] += amount
        max = cities[row["CITY"]][1]
        cities[row["CITY"]][1] = amount if max is None else amount if amount > max else max
        min = cities[row["CITY"]][2]
        cities[row["CITY"]][2] = amount if min is None else amount if amount < min else min
        cities[row["CITY"]][3] += 1


for city in cities:
    cities[city][3] = cities[city][0]/cities[city][3] # calculate mean

with open('test_out.txt', 'wb') as myfile:
    writer = csv.writer(myfile, delimiter="\t")
    writer.writerow(["CITY", "AMOUNT", "AMOUNT2", "AMOUNTn ,"max", "min", "mean"])
    writer.writerows([city] + cities[city] for city in cities)

非常感谢您的帮助。

推荐答案

这里是使用 itertools.groupby

import StringIO
import csv
import itertools

data = """CITY,AMOUNT,AMOUNT2,AMOUNTn
London,20,21,22
Tokyo,45,46,47
London,55,56,57
New York,25,26,27"""

# I use StringIO to create a file like object for demo purposes
f = StringIO.StringIO(data) 
fieldnames = f.readline().strip().split(',')
key = lambda x: x[0] # the first column will be a grouping key
# rows must be sorted by city before passing to itertools.groupby
rows_sorted = sorted(csv.reader(f), key=key)
outfile = StringIO.StringIO('')
writer = csv.DictWriter(outfile, fieldnames=fieldnames, lineterminator='\n')
writer.writeheader()
for city, rows in itertools.groupby(rows_sorted, key=key):
    # remove city column for aggregation, convert to ints
    rows = [[int(x) for x in row[1:]] for row in rows] 
    agg = [sum(column) for column in zip(*rows)]
    writer.writerow(dict(zip(fieldnames, [city] + agg)))

print outfile.getvalue()

# CITY,AMOUNT,AMOUNT2,AMOUNTn
# London,75,77,79
# New York,25,26,27
# Tokyo,45,46,47

这篇关于使用和聚合值解析CSV文件,多列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆