将多个 csv 文件连接成具有相同标头的单个 csv - Python [英] Concatenating multiple csv files into a single csv with the same header - Python
问题描述
我目前正在使用以下代码导入 6,000 个 csv 文件(带标题)并将它们导出到单个 csv 文件(带单个标题行).
I am currently using the below code to import 6,000 csv files (with headers) and export them into a single csv file (with a single header row).
#import csv files from folder
path =r'data/US/market/merged_data'
allFiles = glob.glob(path + "/*.csv")
stockstats_data = pd.DataFrame()
list_ = []
for file_ in allFiles:
df = pd.read_csv(file_,index_col=None,)
list_.append(df)
stockstats_data = pd.concat(list_)
print(file_ + " has been imported.")
这段代码工作正常,但速度很慢.最多可能需要 2 天的时间来处理.
This code works fine, but it is slow. It can take up to 2 days to process.
我得到了一个终端命令行的单行脚本,它执行相同的操作(但没有标题).此脚本需要 20 秒.
I was given a single line script for Terminal command line that does the same (but with no headers). This script takes 20 seconds.
for f in *.csv; do cat "`pwd`/$f" | tail -n +2 >> merged.csv; done
有谁知道我如何加速第一个 Python 脚本?为了缩短时间,我想过不将它导入到 DataFrame 中,而只是连接 CSV,但我无法弄清楚.
Does anyone know how I can speed up the first Python script? To cut the time down, I have thought about not importing it into a DataFrame and just concatenating the CSVs, but I cannot figure it out.
谢谢.
推荐答案
如果你不需要内存中的 CSV,只需从输入复制到输出,那么完全避免解析和复制会便宜很多无需在内存中建立:
If you don't need the CSV in memory, just copying from input to output, it'll be a lot cheaper to avoid parsing at all, and copy without building up in memory:
import shutil
import glob
#import csv files from folder
path = r'data/US/market/merged_data'
allFiles = glob.glob(path + "/*.csv")
allFiles.sort() # glob lacks reliable ordering, so impose your own if output order matters
with open('someoutputfile.csv', 'wb') as outfile:
for i, fname in enumerate(allFiles):
with open(fname, 'rb') as infile:
if i != 0:
infile.readline() # Throw away header on all but first file
# Block copy rest of file from input to output without parsing
shutil.copyfileobj(infile, outfile)
print(fname + " has been imported.")
就是这样;shutil.copyfileobj
处理高效的复制数据,大大减少了 Python 级别的解析和重新序列化工作.
That's it; shutil.copyfileobj
handles efficiently copying the data, dramatically reducing the Python level work to parse and reserialize.
这假设所有 CSV 文件都具有相同的格式、编码、行尾等,并且标头不包含嵌入的换行符,但如果是这种情况,它比替代方案要快得多.
This assumes all the CSV files have the same format, encoding, line endings, etc., and the header doesn't contain embedded newlines, but if that's the case, it's a lot faster than the alternatives.
这篇关于将多个 csv 文件连接成具有相同标头的单个 csv - Python的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!