使用Python分割具有多个标头的CSV文件 [英] Use Python to split a CSV file with multiple headers
问题描述
我有一个不断添加的CSV文件.它具有多个标题,并且标题之间唯一的共同之处是第一列始终为"NAME".
I have a CSV file that is being constantly appended. It has multiple headers and the only common thing among the headers is that the first column is always "NAME".
如何将单个CSV文件拆分为单独的CSV文件,每个标题行一个?
How do I split the single CSV file into separate CSV files, one for each header row?
这是一个示例文件:
"NAME","AGE","SEX","WEIGHT","CITY"
"Bob",20,"M",120,"New York"
"Peter",33,"M",220,"Toronto"
"Mary",43,"F",130,"Miami"
"NAME","COUNTRY","SPORT","NUMBER","SPORT","NUMBER"
"Larry","USA","Football",14,"Baseball",22
"Jenny","UK","Rugby",5,"Field Hockey",11
"Jacques","Canada","Hockey",19,"Volleyball",4
"NAME","DRINK","QTY"
"Jesse","Beer",6
"Wendel","Juice",1
"Angela","Milk",3
推荐答案
如果csv文件的大小不是很大-因此所有文件都可以一次存储在内存中-只需使用read()将文件读入一个字符串,然后在此字符串上使用正则表达式:
If the size of the csv files is not huge -- so all can be in memory at once -- just use read() to read the file into a string and then use a regex on this string:
import re
with open(ur_csv) as f:
data=f.read()
chunks=re.finditer(r'(^"NAME".*?)(?=^"NAME"|\Z)',data,re.S | re.M)
for i, chunk in enumerate(chunks, 1):
with open('/path/{}.csv'.format(i), 'w') as fout:
fout.write(chunk.group(1))
如果需要考虑文件的大小,则可以使用 mmap 创建看起来像一个大字符串但并不同时存在于内存中的东西.
If the size of the file is a concern, you can use mmap to create something that looks like a big string but is not all in memory at the same time.
然后使用带有正则表达式的mmap字符串来分隔csv块,如下所示:
Then use the mmap string with a regex to separate the csv chunks like so:
import mmap
import re
with open(ur_csv) as f:
mf=mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
chunks=re.finditer(r'(^"NAME".*?)(?=^"NAME"|\Z)',mf,re.S | re.M)
for i, chunk in enumerate(chunks, 1):
with open('/path/{}.csv'.format(i), 'w') as fout:
fout.write(chunk.group(1))
无论哪种情况,这都会将所有块写入名为1.csv, 2.csv
等的文件中.
In either case, this will write all the chunks in files named 1.csv, 2.csv
etc.
这篇关于使用Python分割具有多个标头的CSV文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!