pandas groupby一次为多个数据帧/文件 [英] pandas groupby for multiple data frames/files at once

查看:299
本文介绍了pandas groupby一次为多个数据帧/文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有多个巨大的tsv文件,我想用pandas处理。我想按'col3'和'col5'分组。我试过这样:

I have multiple huge tsv files that I'm trying to process using pandas. I want to group by 'col3' and 'col5'. I've tried this:

import pandas as pd
df = pd.read_csv('filename.txt', sep = "\t")
g2 = df.drop_duplicates(['col3', 'col5'])
g3 = g2.groupby(['col3', 'col5']).size().sum(level=0)
print g3

打印如下输出:

yes 2
no  2

我想要能够聚合多个文件的输出,也就是说,能够同时对所有文件中的这两列进行分组,打印一个公共输出,总出现次数为是或否或该属性的任何值。换句话说,我现在想同时使用groupby对多个文件。

I'd like to be able to aggregate the output from multiple files, i.e., to be able to group by these two columns in all the files at once and print one common output with total number of occurrences of 'yes' or 'no' or whatever that attribute could be. In other words, I'd now like to use groupby on multiple files at once. And if a file doesn't have one of these columns, it should be skipped and should go to the next file.

推荐答案

如果一个文件没有其中一个列,则应跳过该列,并转到下一个文件。这是 blaze 的一个很好的用例。

This is a nice use case for blaze.

下面是一个使用一对夫妇的减少文件从一个例子 nyctaxi数据集。我有意将一个大文件拆分为两个文件,每行有1,000,000行:

Here's an example using a couple of reduced files from the nyctaxi dataset. I've purposely split a single large file into two files of 1,000,000 lines each:

In [16]: from blaze import Data, compute, by

In [17]: ls
trip10.csv  trip11.csv

In [18]: d = Data('*.csv')

In [19]: expr = by(d[['passenger_count', 'medallion']], avg_time=d.trip_time_in_secs.mean())

In [20]: %time result = compute(expr)
CPU times: user 3.22 s, sys: 393 ms, total: 3.61 s
Wall time: 3.6 s

In [21]: !du -h *
194M    trip10.csv
192M    trip11.csv

In [22]: len(d)
Out[22]: 2000000

In [23]: result.head()
Out[23]:
   passenger_count                         medallion  avg_time
0                0  08538606A68B9A44756733917323CE4B         0
1                0  0BB9A21E40969D85C11E68A12FAD8DDA        15
2                0  9280082BB6EC79247F47EB181181D1A4         0
3                0  9F4C63E44A6C97DE0EF88E537954FC33         0
4                0  B9182BF4BE3E50250D3EAB3FD790D1C9        14

注意:这将使用熊猫自己的分块CSV读取器执行与熊猫的计算。如果您的文件在GB范围内,最好转换为格式,例如 bcolz PyTables ,因为这些是二进制格式,用于大型文件的数据分析。 CSV是一些带有约定的文本块。

Note: This will perform the computation with pandas, using pandas' own chunked CSV reader. If your files are in the GB range you're better off converting to a format such as bcolz or PyTables, as these are binary formats and designed for data analysis on huge files. CSVs are justs blobs of text with conventions.

这篇关于pandas groupby一次为多个数据帧/文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆