将多个 csv 文件连接成具有相同标头的单个 csv - Python [英] Concatenating multiple csv files into a single csv with the same header - Python

查看:32
本文介绍了将多个 csv 文件连接成具有相同标头的单个 csv - Python的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在使用以下代码导入 6,000 个 csv 文件(带标题)并将它们导出到单个 csv 文件(带单个标题行).

I am currently using the below code to import 6,000 csv files (with headers) and export them into a single csv file (with a single header row).

#import csv files from folder
path =r'data/US/market/merged_data'
allFiles = glob.glob(path + "/*.csv")
stockstats_data = pd.DataFrame()
list_ = []

for file_ in allFiles:
    df = pd.read_csv(file_,index_col=None,)
    list_.append(df)
    stockstats_data = pd.concat(list_)
    print(file_ + " has been imported.")

这段代码工作正常,但速度很慢.最多可能需要 2 天的时间来处理.

This code works fine, but it is slow. It can take up to 2 days to process.

我得到了一个终端命令行的单行脚本,它执行相同的操作(但没有标题).此脚本需要 20 秒.

I was given a single line script for Terminal command line that does the same (but with no headers). This script takes 20 seconds.

 for f in *.csv; do cat "`pwd`/$f" | tail -n +2 >> merged.csv; done 

有谁知道我如何加速第一个 Python 脚本?为了缩短时间,我想过不将它导入到 DataFrame 中,而只是连接 CSV,但我无法弄清楚.

Does anyone know how I can speed up the first Python script? To cut the time down, I have thought about not importing it into a DataFrame and just concatenating the CSVs, but I cannot figure it out.

谢谢.

推荐答案

如果你不需要内存中的 CSV,只需从输入复制到输出,那么完全避免解析和复制会便宜很多无需在内存中建立:

If you don't need the CSV in memory, just copying from input to output, it'll be a lot cheaper to avoid parsing at all, and copy without building up in memory:

import shutil
import glob


#import csv files from folder
path = r'data/US/market/merged_data'
allFiles = glob.glob(path + "/*.csv")
allFiles.sort()  # glob lacks reliable ordering, so impose your own if output order matters
with open('someoutputfile.csv', 'wb') as outfile:
    for i, fname in enumerate(allFiles):
        with open(fname, 'rb') as infile:
            if i != 0:
                infile.readline()  # Throw away header on all but first file
            # Block copy rest of file from input to output without parsing
            shutil.copyfileobj(infile, outfile)
            print(fname + " has been imported.")

就是这样;shutil.copyfileobj 处理高效的复制数据,大大减少了 Python 级别的解析和重新序列化工作.

That's it; shutil.copyfileobj handles efficiently copying the data, dramatically reducing the Python level work to parse and reserialize.

这假设所有 CSV 文件都具有相同的格式、编码、行尾等,并且标头不包含嵌入的换行符,但如果是这种情况,它比替代方案要快得多.

This assumes all the CSV files have the same format, encoding, line endings, etc., and the header doesn't contain embedded newlines, but if that's the case, it's a lot faster than the alternatives.

这篇关于将多个 csv 文件连接成具有相同标头的单个 csv - Python的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆