将数据框分割成单独的CSV文件 [英] Splitting a dataframe into separate CSV files
问题描述
+ --------- + - 我有一个相当大的csv,看起来像这样: -------- +
| Column1 | Column2 |
+ --------- + --------- +
| 1 | 93644 |
| 2 | 63246 |
| 3 | 47790 |
| 3 | 39644 |
| 3 | 32585 |
| 1 | 19593 |
| 1 | 12707 |
| 2 | 53480 |
+ --------- + --------- +
我的意图是
- 添加一个新列
- 将特定值插入根据Column1中的值对文件进行排序
- 将csv的每一行上的'NewColumnValue'列分配给新的
- 根据'Column1'的内容删除文件,删除标题
- 将特定值插入根据Column1中的值对文件进行排序
例如,我想要看到多个文件:
+ --- + ------- + ------------ ---- +
| 1 | 19593 | NewColumnValue |
| 1 | 93644 | NewColumnValue |
| 1 | 12707 | NewColumnValue |
+ --- + ------- + ---------------- +
+ --- + --- ---- + ----------------- +
| 2 | 63246 | NewColumnValue |
| 2 | 53480 | NewColumnValue |
+ --- + ------- + ----------------- +
+ --- + - ----- + ----------------- +
| 3 | 47790 | NewColumnValue |
| 3 | 39644 | NewColumnValue |
| 3 | 32585 | NewColumnValue |
+ --- + ------- + ----------------- +
我已经设法使用单独的.py文件执行此操作:
Step1
# - * - coding:utf-8 - * -
将pandas导入为pd
df = pd.read_csv('source.csv '')
df = df.sort_values('Column1')
df ['NewColumn'] ='NewColumnValue'
df.to_csv('ready.csv',index = False,header =假)
Step2
从itertools导入csv
导入groupby
用于groupby(csv.reader(open(ready.csv)),
lambda row:row [ 0]):
以open(%s.csv%key,w)作为输出:
用于行中的行:
output.write(,。join行)+\ n)
但我很想学习如何完成所有事情在一个单独的.py文件中。我试过这个:
$ b $ pre $ # - * - coding:utf-8 - * -
#处理一个大的CSV文件。
#它将dd一个新列,用每行统一数据填充新列,对CSV进行排序,然后删除标题
#然后,它会将单个大型CSV分割为多个CSV对于列0中的值
将pandas导入为pd
从itertools导入csv
导入groupby
df = pd.read_csv('source.csv')
df =
对于keyby,groupby(csv.reader((df)),
lambda行中的行:
df ['NewColumn'] ='NewColumnValue' [$ 0]):
打开(%s.csv%key,w)作为输出:
用于行中的行:
output.write(,。join (row)+\\\
)
但它并没有像预期的那样工作,它给了我多重CSVs以每个列标题命名。
是否发生这种情况是因为我在使用单独的.py文件时删除了标题行,而我在这里没有做到这一点?我并不确定在分割文件以移除标题时需要执行什么操作。
为什么不只是groupby Column1
并保存每个组?
df = df.sort_values(' Column1')。assign(NewColumn ='NewColumnValue')
print(df)
Column1 Column2 NewColumn
0 1 93644 NewColumnValue
5 1 19593 NewColumnValue
6 1 12707 NewColumnValue
1 2 63246 NewColumnValue
7 2 53480 NewColumnValue
2 3 47790 NewColumnValue
3 3 39644 NewColumnValue
4 3 32585 NewColumnValue
for i,g in df .groupby('Column1'):
g.to_csv('{}。csv'.format(i),header = False,index_label = False)
感谢Unatiel为改善。 header = False
不会写入标题, index_label = False
不会写入索引列。
这会创建3个文件:
1.csv
2.csv
3.csv
每个数据对应每个 Column1
group。
I have a fairly large csv, looking like this:
+---------+---------+
| Column1 | Column2 |
+---------+---------+
| 1 | 93644 |
| 2 | 63246 |
| 3 | 47790 |
| 3 | 39644 |
| 3 | 32585 |
| 1 | 19593 |
| 1 | 12707 |
| 2 | 53480 |
+---------+---------+
My intent is to
- Add a new column
- Insert a specific value into that column, 'NewColumnValue', on each row of the csv
- Sort the file based on the value in Column1
- Split the original CSV into new files based on the contents of 'Column1', removing the header
For example, I want to end up with multiple files that look like:
+---+-------+----------------+
| 1 | 19593 | NewColumnValue |
| 1 | 93644 | NewColumnValue |
| 1 | 12707 | NewColumnValue |
+---+-------+----------------+
+---+-------+-----------------+
| 2 | 63246 | NewColumnValue |
| 2 | 53480 | NewColumnValue |
+---+-------+-----------------+
+---+-------+-----------------+
| 3 | 47790 | NewColumnValue |
| 3 | 39644 | NewColumnValue |
| 3 | 32585 | NewColumnValue |
+---+-------+-----------------+
I have managed to do this using separate .py files:
Step1
# -*- coding: utf-8 -*-
import pandas as pd
df = pd.read_csv('source.csv')
df = df.sort_values('Column1')
df['NewColumn'] = 'NewColumnValue'
df.to_csv('ready.csv', index=False, header=False)
Step2
import csv
from itertools import groupby
for key, rows in groupby(csv.reader(open("ready.csv")),
lambda row: row[0]):
with open("%s.csv" % key, "w") as output:
for row in rows:
output.write(",".join(row) + "\n")
But I'd really like to learn how to accomplish everything in a single .py file. I tried this:
# -*- coding: utf-8 -*-
#This processes a large CSV file.
#It will dd a new column, populate the new column with a uniform piece of data for each row, sort the CSV, and remove headers
#Then it will split the single large CSV into multiple CSVs based on the value in column 0
import pandas as pd
import csv
from itertools import groupby
df = pd.read_csv('source.csv')
df = df.sort_values('Column1')
df['NewColumn'] = 'NewColumnValue'
for key, rows in groupby(csv.reader((df)),
lambda row: row[0]):
with open("%s.csv" % key, "w") as output:
for row in rows:
output.write(",".join(row) + "\n")
but instead of working as intended, it's giving me multiple CSVs named after each column header.
Is that happening because I removed the header row when I used separate .py files and I'm not doing it here? I'm not really certain what operation I need to do when splitting the files to remove the header.
Why not just groupby Column1
and save each group?
df = df.sort_values('Column1').assign(NewColumn='NewColumnValue')
print(df)
Column1 Column2 NewColumn
0 1 93644 NewColumnValue
5 1 19593 NewColumnValue
6 1 12707 NewColumnValue
1 2 63246 NewColumnValue
7 2 53480 NewColumnValue
2 3 47790 NewColumnValue
3 3 39644 NewColumnValue
4 3 32585 NewColumnValue
for i, g in df.groupby('Column1'):
g.to_csv('{}.csv'.format(i), header=False, index_label=False)
Thanks to Unatiel for the improvement. header=False
will not write headers and index_label=False
will not write an index column.
This creates 3 files:
1.csv
2.csv
3.csv
Each having data corresponding to each Column1
group.
这篇关于将数据框分割成单独的CSV文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!