将数据框分割成单独的CSV文件 [英] Splitting a dataframe into separate CSV files

查看：158 发布时间：2018/5/30 13:52:36 python pandas dataframe group-by pandas-groupby

本文介绍了将数据框分割成单独的CSV文件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

  + --------- +  -  
 我有一个相当大的csv，看起来像这样： -------- + 
 | Column1 | Column2 | 
 + --------- + --------- + 
 | 1 | 93644 | 
 | 2 | 63246 | 
 | 3 | 47790 | 
 | 3 | 39644 | 
 | 3 | 32585 | 
 | 1 | 19593 | 
 | 1 | 12707 | 
 | 2 | 53480 | 
 + --------- + --------- +

我的意图是

添加一个新列
将特定值插入根据Column1中的值对文件进行排序
将csv的每一行上的'NewColumnValue'列分配给新的
根据'Column1'的内容删除文件，删除标题

例如，我想要看到多个文件：

  + --- + ------- + ------------ ---- + 
 | 1 | 19593 | NewColumnValue | 
 | 1 | 93644 | NewColumnValue | 
 | 1 | 12707 | NewColumnValue | 
 + --- + ------- + ---------------- + 
 
 + --- + --- ---- + ----------------- + 
 | 2 | 63246 | NewColumnValue | 
 | 2 | 53480 | NewColumnValue | 
 + --- + ------- + ----------------- + 
 
 + --- +  - ----- + ----------------- + 
 | 3 | 47790 | NewColumnValue | 
 | 3 | 39644 | NewColumnValue | 
 | 3 | 32585 | NewColumnValue | 
 + --- + ------- + ----------------- +

我已经设法使用单独的.py文件执行此操作：

Step1
＃ - * - coding：utf-8 - * - 将pandas导入为pd df = pd.read_csv（'source.csv ''） df = df.sort_values（'Column1'） df ['NewColumn'] ='NewColumnValue' df.to_csv（'ready.csv'，index = False，header =假）
Step2

从itertools导入csv 导入groupby 用于groupby（csv.reader（open（ready.csv））， lambda row：row [ 0]）：以open（％s.csv％key，w）作为输出：用于行中的行： output.write（，。join行）+\ n）
但我很想学习如何完成所有事情在一个单独的.py文件中。我试过这个：
$ b $ pre $ ＃ - * - coding：utf-8 - * - ＃处理一个大的CSV文件。＃它将dd一个新列，用每行统一数据填充新列，对CSV进行排序，然后删除标题＃然后，它会将单个大型CSV分割为多个CSV对于列0中的值将pandas导入为pd 从itertools导入csv 导入groupby df = pd.read_csv（'source.csv'） df = 对于keyby，groupby（csv.reader（（df））， lambda行中的行： df ['NewColumn'] ='NewColumnValue' [$ 0]）：打开（％s.csv％key，w）作为输出：用于行中的行： output.write（，。join （row）+\\\ ）
但它并没有像预期的那样工作，它给了我多重CSVs以每个列标题命名。

是否发生这种情况是因为我在使用单独的.py文件时删除了标题行，而我在这里没有做到这一点？我并不确定在分割文件以移除标题时需要执行什么操作。
解决方案
为什么不只是groupby Column1 并保存每个组？

df = df.sort_values（' Column1'）。assign（NewColumn ='NewColumnValue'） print（df） Column1 Column2 NewColumn 0 1 93644 NewColumnValue 5 1 19593 NewColumnValue 6 1 12707 NewColumnValue 1 2 63246 NewColumnValue 7 2 53480 NewColumnValue 2 3 47790 NewColumnValue 3 3 39644 NewColumnValue 4 3 32585 NewColumnValue

for i，g in df .groupby（'Column1'）： g.to_csv（'{}。csv'.format（i），header = False，index_label = False）
感谢Unatiel为改善。 header = False 不会写入标题， index_label = False 不会写入索引列。

这会创建3个文件：

1.csv 2.csv 3.csv
每个数据对应每个 Column1 group。

I have a fairly large csv, looking like this:
+---------+---------+ | Column1 | Column2 | +---------+---------+ | 1 | 93644 | | 2 | 63246 | | 3 | 47790 | | 3 | 39644 | | 3 | 32585 | | 1 | 19593 | | 1 | 12707 | | 2 | 53480 | +---------+---------+
My intent is to

Add a new column

Insert a specific value into that column, 'NewColumnValue', on each row of the csv

Sort the file based on the value in Column1

Split the original CSV into new files based on the contents of 'Column1', removing the header

For example, I want to end up with multiple files that look like:
+---+-------+----------------+ | 1 | 19593 | NewColumnValue | | 1 | 93644 | NewColumnValue | | 1 | 12707 | NewColumnValue | +---+-------+----------------+ +---+-------+-----------------+ | 2 | 63246 | NewColumnValue | | 2 | 53480 | NewColumnValue | +---+-------+-----------------+ +---+-------+-----------------+ | 3 | 47790 | NewColumnValue | | 3 | 39644 | NewColumnValue | | 3 | 32585 | NewColumnValue | +---+-------+-----------------+
I have managed to do this using separate .py files:

Step1
# -*- coding: utf-8 -*- import pandas as pd df = pd.read_csv('source.csv') df = df.sort_values('Column1') df['NewColumn'] = 'NewColumnValue' df.to_csv('ready.csv', index=False, header=False)
Step2
import csv from itertools import groupby for key, rows in groupby(csv.reader(open("ready.csv")), lambda row: row[0]): with open("%s.csv" % key, "w") as output: for row in rows: output.write(",".join(row) + "\n")
But I'd really like to learn how to accomplish everything in a single .py file. I tried this:
# -*- coding: utf-8 -*- #This processes a large CSV file. #It will dd a new column, populate the new column with a uniform piece of data for each row, sort the CSV, and remove headers #Then it will split the single large CSV into multiple CSVs based on the value in column 0 import pandas as pd import csv from itertools import groupby df = pd.read_csv('source.csv') df = df.sort_values('Column1') df['NewColumn'] = 'NewColumnValue' for key, rows in groupby(csv.reader((df)), lambda row: row[0]): with open("%s.csv" % key, "w") as output: for row in rows: output.write(",".join(row) + "\n")
but instead of working as intended, it's giving me multiple CSVs named after each column header.

Is that happening because I removed the header row when I used separate .py files and I'm not doing it here? I'm not really certain what operation I need to do when splitting the files to remove the header.
解决方案
Why not just groupby Column1 and save each group?
df = df.sort_values('Column1').assign(NewColumn='NewColumnValue') print(df) Column1 Column2 NewColumn 0 1 93644 NewColumnValue 5 1 19593 NewColumnValue 6 1 12707 NewColumnValue 1 2 63246 NewColumnValue 7 2 53480 NewColumnValue 2 3 47790 NewColumnValue 3 3 39644 NewColumnValue 4 3 32585 NewColumnValue

for i, g in df.groupby('Column1'): g.to_csv('{}.csv'.format(i), header=False, index_label=False)
Thanks to Unatiel for the improvement. header=False will not write headers and index_label=False will not write an index column.

This creates 3 files:
1.csv 2.csv 3.csv
Each having data corresponding to each Column1 group.

这篇关于将数据框分割成单独的CSV文件的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

将数据框分割成单独的CSV文件 [英] Splitting a dataframe into separate CSV files

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

将数据框分割成单独的CSV文件 [英] Splitting a dataframe into separate CSV files

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭