将多个csv文件串联到具有相同标头的单个csv中-Python [英] Concatenating multiple csv files into a single csv with the same header - Python

查看:92
本文介绍了将多个csv文件串联到具有相同标头的单个csv中-Python的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在使用以下代码导入6,000个csv文件(带有标题),并将它们导出到单个csv文件(带有单个标题行).

I am currently using the below code to import 6,000 csv files (with headers) and export them into a single csv file (with a single header row).

#import csv files from folder
path =r'data/US/market/merged_data'
allFiles = glob.glob(path + "/*.csv")
stockstats_data = pd.DataFrame()
list_ = []

for file_ in allFiles:
    df = pd.read_csv(file_,index_col=None,)
    list_.append(df)
    stockstats_data = pd.concat(list_)
    print(file_ + " has been imported.")

此代码可以正常工作,但是速度很慢.处理最多可能需要2天.

This code works fine, but it is slow. It can take up to 2 days to process.

我为Terminal命令行提供了一个单行脚本,它执行相同的操作(但没有标题).该脚本需要20秒钟.

I was given a single line script for Terminal command line that does the same (but with no headers). This script takes 20 seconds.

 for f in *.csv; do cat "`pwd`/$f" | tail -n +2 >> merged.csv; done 

有人知道我如何加快第一个Python脚本的速度吗?为了减少时间,我曾考虑过不将其导入到DataFrame中,而只是将CSV连接起来,但我无法弄清楚.

Does anyone know how I can speed up the first Python script? To cut the time down, I have thought about not importing it into a DataFrame and just concatenating the CSVs, but I cannot figure it out.

谢谢.

推荐答案

如果您不需要在内存中使用CSV,只需从输入复制到输出,完全避免解析并复制会便宜很多.不会在内存中累积:

If you don't need the CSV in memory, just copying from input to output, it'll be a lot cheaper to avoid parsing at all, and copy without building up in memory:

import shutil
import glob


#import csv files from folder
path = r'data/US/market/merged_data'
allFiles = glob.glob(path + "/*.csv")
allFiles.sort()  # glob lacks reliable ordering, so impose your own if output order matters
with open('someoutputfile.csv', 'wb') as outfile:
    for i, fname in enumerate(allFiles):
        with open(fname, 'rb') as infile:
            if i != 0:
                infile.readline()  # Throw away header on all but first file
            # Block copy rest of file from input to output without parsing
            shutil.copyfileobj(infile, outfile)
            print(fname + " has been imported.")

就是这样; shutil.copyfileobj 可有效地复制数据,从而大大减少了Python级的工作来解析和重新序列化.

That's it; shutil.copyfileobj handles efficiently copying the data, dramatically reducing the Python level work to parse and reserialize.

这假定所有CSV文件都具有相同的格式,编码,行尾等,并且标头不包含嵌入的换行符,但如果是这种情况,它的速度比替代格式要快得多.

This assumes all the CSV files have the same format, encoding, line endings, etc., and the header doesn't contain embedded newlines, but if that's the case, it's a lot faster than the alternatives.

这篇关于将多个csv文件串联到具有相同标头的单个csv中-Python的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆