将数据框分割成单独的CSV文件 [英] Splitting a dataframe into separate CSV files

查看:158
本文介绍了将数据框分割成单独的CSV文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

  + --------- +  -  

我有一个相当大的csv,看起来像这样: -------- +
| Column1 | Column2 |
+ --------- + --------- +
| 1 | 93644 |
| 2 | 63246 |
| 3 | 47790 |
| 3 | 39644 |
| 3 | 32585 |
| 1 | 19593 |
| 1 | 12707 |
| 2 | 53480 |
+ --------- + --------- +

我的意图是


  1. 添加一个新列
  2. 将特定值插入根据Column1中的值对文件进行排序
  3. 将csv的每一行上的'NewColumnValue'列分配给新的
  4. 根据'Column1'的内容删除文件,删除标题

例如,我想要看到多个文件:

  + --- + ------- + ------------ ---- + 
| 1 | 19593 | NewColumnValue |
| 1 | 93644 | NewColumnValue |
| 1 | 12707 | NewColumnValue |
+ --- + ------- + ---------------- +

+ --- + --- ---- + ----------------- +
| 2 | 63246 | NewColumnValue |
| 2 | 53480 | NewColumnValue |
+ --- + ------- + ----------------- +

+ --- + - ----- + ----------------- +
| 3 | 47790 | NewColumnValue |
| 3 | 39644 | NewColumnValue |
| 3 | 32585 | NewColumnValue |
+ --- + ------- + ----------------- +

我已经设法使用单独的.py文件执行此操作:



Step1

 # -  *  -  coding:utf-8  -  *  -  
将pandas导入为pd
df = pd.read_csv('source.csv '')
df = df.sort_values('Column1')
df ['NewColumn'] ='NewColumnValue'
df.to_csv('ready.csv',index = False,header =假)

Step2

 从itertools导入csv 
导入groupby
用于groupby(csv.reader(open(ready.csv)),
lambda row:row [ 0]):
以open(%s.csv%key,w)作为输出:
用于行中的行:
output.write(,。join行)+\ n)

但我很想学习如何完成所有事情在一个单独的.py文件中。我试过这个:
$ b $ pre $ # - * - coding:utf-8 - * -
#处理一个大的CSV文件。
#它将dd一个新列,用每行统一数据填充新列,对CSV进行排序,然后删除标题
#然后,它会将单个大型CSV分割为多个CSV对于列0中的值
将pandas导入为pd
从itertools导入csv
导入groupby
df = pd.read_csv('source.csv')
df =
对于keyby,groupby(csv.reader((df)),
lambda行中的行:
df ['NewColumn'] ='NewColumnValue' [$ 0]):
打开(%s.csv%key,w)作为输出:
用于行中的行:
output.write(,。join (row)+\\\

但它并没有像预期的那样工作,它给了我多重CSVs以每个列标题命名。



是否发生这种情况是因为我在使用单独的.py文件时删除了标题行,而我在这里没有做到这一点?我并不确定在分割文件以移除标题时需要执行什么操作。

解决方案

为什么不只是groupby Column1 并保存每个组?

  df = df.sort_values(' Column1')。assign(NewColumn ='NewColumnValue')
print(df)

Column1 Column2 NewColumn
0 1 93644 NewColumnValue
5 1 19593 NewColumnValue
6 1 12707 NewColumnValue
1 2 63246 NewColumnValue
7 2 53480 NewColumnValue
2 3 47790 NewColumnValue
3 3 39644 NewColumnValue
4 3 32585 NewColumnValue






  for i,g in df .groupby('Column1'):
g.to_csv('{}。csv'.format(i),header = False,index_label = False)

感谢Unatiel为改善 header = False 不会写入标题, index_label = False 不会写入索引列。



这会创建3个文件:

  1.csv 
2.csv
3.csv

每个数据对应每个 Column1 group。


I have a fairly large csv, looking like this:

+---------+---------+
| Column1 | Column2 |
+---------+---------+
|       1 |   93644 |
|       2 |   63246 |
|       3 |   47790 |
|       3 |   39644 |
|       3 |   32585 |
|       1 |   19593 |
|       1 |   12707 |
|       2 |   53480 |
+---------+---------+

My intent is to

  1. Add a new column
  2. Insert a specific value into that column, 'NewColumnValue', on each row of the csv
  3. Sort the file based on the value in Column1
  4. Split the original CSV into new files based on the contents of 'Column1', removing the header

For example, I want to end up with multiple files that look like:

+---+-------+----------------+
| 1 | 19593 | NewColumnValue |
| 1 | 93644 | NewColumnValue |
| 1 | 12707 | NewColumnValue |
+---+-------+----------------+

+---+-------+-----------------+
| 2 | 63246 | NewColumnValue |
| 2 | 53480 | NewColumnValue |
+---+-------+-----------------+

+---+-------+-----------------+
| 3 | 47790 | NewColumnValue |
| 3 | 39644 | NewColumnValue |
| 3 | 32585 | NewColumnValue |
+---+-------+-----------------+

I have managed to do this using separate .py files:

Step1

# -*- coding: utf-8 -*-
import pandas as pd
df = pd.read_csv('source.csv')
df = df.sort_values('Column1')
df['NewColumn'] = 'NewColumnValue'
df.to_csv('ready.csv', index=False, header=False)

Step2

import csv
from itertools import groupby
for key, rows in groupby(csv.reader(open("ready.csv")),
                         lambda row: row[0]):
    with open("%s.csv" % key, "w") as output:
        for row in rows:
            output.write(",".join(row) + "\n")

But I'd really like to learn how to accomplish everything in a single .py file. I tried this:

# -*- coding: utf-8 -*-
#This processes a large CSV file.  
#It will dd a new column, populate the new column with a uniform piece of data for each row, sort the CSV, and remove headers
#Then it will split the single large CSV into multiple CSVs based on the value in column 0 
import pandas as pd
import csv
from itertools import groupby
df = pd.read_csv('source.csv')
df = df.sort_values('Column1')
df['NewColumn'] = 'NewColumnValue'
for key, rows in groupby(csv.reader((df)),
                         lambda row: row[0]):
    with open("%s.csv" % key, "w") as output:
        for row in rows:
            output.write(",".join(row) + "\n")

but instead of working as intended, it's giving me multiple CSVs named after each column header.

Is that happening because I removed the header row when I used separate .py files and I'm not doing it here? I'm not really certain what operation I need to do when splitting the files to remove the header.

解决方案

Why not just groupby Column1 and save each group?

df = df.sort_values('Column1').assign(NewColumn='NewColumnValue')
print(df)

   Column1  Column2       NewColumn
0        1    93644  NewColumnValue
5        1    19593  NewColumnValue
6        1    12707  NewColumnValue
1        2    63246  NewColumnValue
7        2    53480  NewColumnValue
2        3    47790  NewColumnValue
3        3    39644  NewColumnValue
4        3    32585  NewColumnValue


for i, g in df.groupby('Column1'):
    g.to_csv('{}.csv'.format(i), header=False, index_label=False)

Thanks to Unatiel for the improvement. header=False will not write headers and index_label=False will not write an index column.

This creates 3 files:

1.csv
2.csv
3.csv

Each having data corresponding to each Column1 group.

这篇关于将数据框分割成单独的CSV文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆