Python-读取csv并按列对数据进行分组 [英] Python - reading a csv and grouping data by a column

查看:990
本文介绍了Python-读取csv并按列对数据进行分组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在处理包含以下三列的csv文件:

I am working with a csv file with 3 columns that looks like this:

timeStamp, value, label
15:22:57, 849, CPU pid=26298:percent
15:22:57, 461000, JMX MB
15:22:58, 28683, Disks I/O
15:22:58, 3369078, Memory pid=26298:unit=mb:resident
15:22:58, 0, JMX 31690:gc-time
15:22:58, 0, CPU pid=26298:percent
15:22:58, 503000, JMX MB

label列包含不同的值(总共为5),其中包括空格,冒号和其他特殊字符.

The label column contains distinct values (say a total of 5), which include spaces, colons and other special characters.

我要达到的目的是针对每个指标(在同一图上或在不同图上)绘制时间.我可以使用matplotlib来做到这一点,但是我首先需要根据标签"对[timeStamps, value]对进行分组.

What I am trying to achieve is to plot time against each metric (either on the same plot or on separate ones). I can do this with matplotlib, but I first need to group the [timeStamps, value] pairs according to the 'label'.

我调查了csv.DictReader以获取标签,并查看了itertools.groupby以标签"分组,但是我正努力以正确的"pythonic"方式进行此操作.

I looked into the csv.DictReader to get the labels and the itertools.groupby to group by the 'label', but I am struggling to do this in a proper 'pythonic' way.

有什么建议吗?

推荐答案

您不需要groupby;您要使用 collections.defaultdict 来收集一系列以标签为键的对:

You don't need groupby; you want to use collections.defaultdict to collect series of [timestamp, value] pairs keyed by label:

from collections import defaultdict
import csv

per_label = defaultdict(list)

with open(inputfilename, 'rb') as inputfile:
    reader = csv.reader(inputfile)
    next(reader, None)  # skip the header row

    for timestamp, value, label in reader:
        per_label[label.strip()].append([timestamp.strip(), float(value)])

现在,per_label是一个字典,其中标签作为键,而[timestamp, value]对的列表作为值;我去除了空格(您的输入示例有很多额外的空格),并将value列变成了float.

Now per_label is a dictionary with labels as keys, and a list of [timestamp, value] pairs as values; I've stripped off whitespace (your input sample has a lot of extra whitespace) and turned the value column into floats.

对于您的(有限)输入示例,其结果是:

For your (limited) input sample that results in:

{'CPU pid=26298:percent': [['15:22:57', 849.0], ['15:22:58', 0.0]],
 'Disks I/O': [['15:22:58', 28683.0]],
 'JMX 31690:gc-time': [['15:22:58', 0.0]],
 'JMX MB': [['15:22:57', 461000.0], ['15:22:58', 503000.0]],
 'Memory pid=26298:unit=mb:resident': [['15:22:58', 3369078.0]]}

这篇关于Python-读取csv并按列对数据进行分组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆