将数据帧拆分为单独的 CSV 文件 [英] Splitting a dataframe into separate CSV files

查看:42
本文介绍了将数据帧拆分为单独的 CSV 文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个相当大的 csv,如下所示:

I have a fairly large csv, looking like this:

+---------+---------+
| Column1 | Column2 |
+---------+---------+
|       1 |   93644 |
|       2 |   63246 |
|       3 |   47790 |
|       3 |   39644 |
|       3 |   32585 |
|       1 |   19593 |
|       1 |   12707 |
|       2 |   53480 |
+---------+---------+

我的目的是

  1. 添加新列
  2. 在 csv 的每一行上将特定值插入该列NewColumnValue"中
  3. 根据 Column1 中的值对文件进行排序
  4. 根据Column1"的内容将原始 CSV 拆分为新文件,删除标题

例如,我希望得到多个看起来像这样的文件:

For example, I want to end up with multiple files that look like:

+---+-------+----------------+
| 1 | 19593 | NewColumnValue |
| 1 | 93644 | NewColumnValue |
| 1 | 12707 | NewColumnValue |
+---+-------+----------------+

+---+-------+-----------------+
| 2 | 63246 | NewColumnValue |
| 2 | 53480 | NewColumnValue |
+---+-------+-----------------+

+---+-------+-----------------+
| 3 | 47790 | NewColumnValue |
| 3 | 39644 | NewColumnValue |
| 3 | 32585 | NewColumnValue |
+---+-------+-----------------+

我设法使用单独的 .py 文件做到了这一点:

I have managed to do this using separate .py files:

第一步

# -*- coding: utf-8 -*-
import pandas as pd
df = pd.read_csv('source.csv')
df = df.sort_values('Column1')
df['NewColumn'] = 'NewColumnValue'
df.to_csv('ready.csv', index=False, header=False)

步骤 2

import csv
from itertools import groupby
for key, rows in groupby(csv.reader(open("ready.csv")),
                         lambda row: row[0]):
    with open("%s.csv" % key, "w") as output:
        for row in rows:
            output.write(",".join(row) + "
")

但我真的很想学习如何在单个 .py 文件中完成所有工作.我试过这个:

But I'd really like to learn how to accomplish everything in a single .py file. I tried this:

# -*- coding: utf-8 -*-
#This processes a large CSV file.  
#It will dd a new column, populate the new column with a uniform piece of data for each row, sort the CSV, and remove headers
#Then it will split the single large CSV into multiple CSVs based on the value in column 0 
import pandas as pd
import csv
from itertools import groupby
df = pd.read_csv('source.csv')
df = df.sort_values('Column1')
df['NewColumn'] = 'NewColumnValue'
for key, rows in groupby(csv.reader((df)),
                         lambda row: row[0]):
    with open("%s.csv" % key, "w") as output:
        for row in rows:
            output.write(",".join(row) + "
")

但它没有按预期工作,而是给了我多个以每个列标题命名的 CSV.

but instead of working as intended, it's giving me multiple CSVs named after each column header.

发生这种情况是因为我在使用单独的 .py 文件时删除了标题行而我没有在这里这样做吗?我不太确定在拆分文件以删除标题时需要执行什么操作.

Is that happening because I removed the header row when I used separate .py files and I'm not doing it here? I'm not really certain what operation I need to do when splitting the files to remove the header.

推荐答案

为什么不直接分组 Column1 并保存每个组?

Why not just groupby Column1 and save each group?

df = df.sort_values('Column1').assign(NewColumn='NewColumnValue')
print(df)

   Column1  Column2       NewColumn
0        1    93644  NewColumnValue
5        1    19593  NewColumnValue
6        1    12707  NewColumnValue
1        2    63246  NewColumnValue
7        2    53480  NewColumnValue
2        3    47790  NewColumnValue
3        3    39644  NewColumnValue
4        3    32585  NewColumnValue

<小时>

for i, g in df.groupby('Column1'):
    g.to_csv('{}.csv'.format(i), header=False, index_label=False)

感谢 Unatiel 的 改进.header=False 不会写标题,index_label=False 不会写索引列.

Thanks to Unatiel for the improvement. header=False will not write headers and index_label=False will not write an index column.

这将创建 3 个文件:

This creates 3 files:

1.csv
2.csv
3.csv

每个都有对应于每个Column1 组的数据.

Each having data corresponding to each Column1 group.

这篇关于将数据帧拆分为单独的 CSV 文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆