根据groupby分割csv文件数千次 [英] Split csv file thousands of times based on groupby

查看:106
本文介绍了根据groupby分割csv文件数千次的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

(David Erickson的问题的改编

(An adaptation of David Erickson's question here)

给出一个包含A,B和C列以及一些值的CSV文件:

Given a CSV file with columns A, B, and C and some values:

echo 'a,b,c' > file.csv
head -c 10000000 /dev/urandom | od -d | awk 'BEGIN{OFS = ","}{print $2, $3, $4}' | head -n 10000 >> file.csv

我们想按a和b列进行排序:

We would like to sort by columns a and b:

sort -t ',' -k1,1n -k2,2n file.csv > file_.csv
head -n 3 file_.csv
>a,b,c
3,50240,18792
7,54871,39438

然后为每个唯一对(a, b)创建一个名为'{a}_Invoice_{b}.csv'的新CSV.

And then for every unique pair (a, b) create a new CSV titled '{a}_Invoice_{b}.csv'.

主要挑战似乎是写入数千个文件的I/O开销-我开始尝试使用awk,但遇到了awk: 17 makes too many open files.

The main challenge seems to be the I/O overhead of writing thousands of files - I started trying with awk but ran into awk: 17 makes too many open files.

是否可以使用awk,Python或其他脚本语言来更快地做到这一点?

Is there a quicker way to do this, in awk, Python, or some other scripting language?

其他信息:

  • 我知道我可以在Pandas中做到这一点-我正在寻找一种使用文本处理的更快方法
  • 尽管我使用urandom生成了示例数据,但实际数据具有重复出现的值:例如a=3, b=7的几行.如果是这样,则应将它们另存为一个文件. (这个想法是复制Pandas的groupby-> to_csv)
  • I know I can do this in Pandas - I'm looking for a faster way using text processing
  • Though I used urandom to generate the sample data, the real data has runs of recurring values: for example a few rows where a=3, b=7. If so these should be saved as one file. (The idea is to replicate Pandas' groupby -> to_csv)

推荐答案

在python中:

import pandas as pd

df = pd.read_csv("file.csv")
for (a, b), gb in df.groupby(['a', 'b']):
    gb.to_csv(f"{a}_Invoice_{b}.csv", header=True, index=False)


在awk中,您可以像这样拆分,您需要将标头放回每个生成的文件上:


In awk you can split like so, you will need to put the header back on each resultant file:

awk -F',' '{ out=$1"_Invoice_"$2".csv"; print >> out; close(out) }' file.csv

将标题行添加回去:

awk -F',' 'NR==1 { hdr=$0; next } { out=$1"_Invoice_"$2".csv"; if (!seen[out]++) {print hdr > out} print >> out; close(out); }' file.csv

最后一个示例的好处是输入file.csv无需排序,只需一次处理即可.

The benefit of this last example is that the input file.csv doesn't need to be sorted and is processed in a single pass.

这篇关于根据groupby分割csv文件数千次的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆