将数据帧拆分为单独的 CSV 文件 [英] Splitting a dataframe into separate CSV files
问题描述
我有一个相当大的 csv,如下所示:
I have a fairly large csv, looking like this:
+---------+---------+
| Column1 | Column2 |
+---------+---------+
| 1 | 93644 |
| 2 | 63246 |
| 3 | 47790 |
| 3 | 39644 |
| 3 | 32585 |
| 1 | 19593 |
| 1 | 12707 |
| 2 | 53480 |
+---------+---------+
我的目的是
- 添加新列
- 在 csv 的每一行上将特定值插入该列NewColumnValue"中
- 根据 Column1 中的值对文件进行排序
- 根据Column1"的内容将原始 CSV 拆分为新文件,删除标题
例如,我希望得到多个看起来像这样的文件:
For example, I want to end up with multiple files that look like:
+---+-------+----------------+
| 1 | 19593 | NewColumnValue |
| 1 | 93644 | NewColumnValue |
| 1 | 12707 | NewColumnValue |
+---+-------+----------------+
+---+-------+-----------------+
| 2 | 63246 | NewColumnValue |
| 2 | 53480 | NewColumnValue |
+---+-------+-----------------+
+---+-------+-----------------+
| 3 | 47790 | NewColumnValue |
| 3 | 39644 | NewColumnValue |
| 3 | 32585 | NewColumnValue |
+---+-------+-----------------+
我设法使用单独的 .py 文件做到了这一点:
I have managed to do this using separate .py files:
第一步
# -*- coding: utf-8 -*-
import pandas as pd
df = pd.read_csv('source.csv')
df = df.sort_values('Column1')
df['NewColumn'] = 'NewColumnValue'
df.to_csv('ready.csv', index=False, header=False)
步骤 2
import csv
from itertools import groupby
for key, rows in groupby(csv.reader(open("ready.csv")),
lambda row: row[0]):
with open("%s.csv" % key, "w") as output:
for row in rows:
output.write(",".join(row) + "
")
但我真的很想学习如何在单个 .py 文件中完成所有工作.我试过这个:
But I'd really like to learn how to accomplish everything in a single .py file. I tried this:
# -*- coding: utf-8 -*-
#This processes a large CSV file.
#It will dd a new column, populate the new column with a uniform piece of data for each row, sort the CSV, and remove headers
#Then it will split the single large CSV into multiple CSVs based on the value in column 0
import pandas as pd
import csv
from itertools import groupby
df = pd.read_csv('source.csv')
df = df.sort_values('Column1')
df['NewColumn'] = 'NewColumnValue'
for key, rows in groupby(csv.reader((df)),
lambda row: row[0]):
with open("%s.csv" % key, "w") as output:
for row in rows:
output.write(",".join(row) + "
")
但它没有按预期工作,而是给了我多个以每个列标题命名的 CSV.
but instead of working as intended, it's giving me multiple CSVs named after each column header.
发生这种情况是因为我在使用单独的 .py 文件时删除了标题行而我没有在这里这样做吗?我不太确定在拆分文件以删除标题时需要执行什么操作.
Is that happening because I removed the header row when I used separate .py files and I'm not doing it here? I'm not really certain what operation I need to do when splitting the files to remove the header.
推荐答案
为什么不直接分组 Column1
并保存每个组?
Why not just groupby Column1
and save each group?
df = df.sort_values('Column1').assign(NewColumn='NewColumnValue')
print(df)
Column1 Column2 NewColumn
0 1 93644 NewColumnValue
5 1 19593 NewColumnValue
6 1 12707 NewColumnValue
1 2 63246 NewColumnValue
7 2 53480 NewColumnValue
2 3 47790 NewColumnValue
3 3 39644 NewColumnValue
4 3 32585 NewColumnValue
<小时>
for i, g in df.groupby('Column1'):
g.to_csv('{}.csv'.format(i), header=False, index_label=False)
感谢 Unatiel 的 改进.header=False
不会写标题,index_label=False
不会写索引列.
Thanks to Unatiel for the improvement. header=False
will not write headers and index_label=False
will not write an index column.
这将创建 3 个文件:
This creates 3 files:
1.csv
2.csv
3.csv
每个都有对应于每个Column1
组的数据.
Each having data corresponding to each Column1
group.
这篇关于将数据帧拆分为单独的 CSV 文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!