将多个 csv 文件连接成具有相同标头的单个 csv - Python [英] Concatenating multiple csv files into a single csv with the same header - Python

查看：32 发布时间：2021/12/28 10:19:22 python csv pandas terminal concatenation

本文介绍了将多个 csv 文件连接成具有相同标头的单个 csv - Python的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我目前正在使用以下代码导入 6,000 个 csv 文件(带标题)并将它们导出到单个 csv 文件(带单个标题行).

I am currently using the below code to import 6,000 csv files (with headers) and export them into a single csv file (with a single header row).

#import csv files from folder
path =r'data/US/market/merged_data'
allFiles = glob.glob(path + "/*.csv")
stockstats_data = pd.DataFrame()
list_ = []

for file_ in allFiles:
    df = pd.read_csv(file_,index_col=None,)
    list_.append(df)
    stockstats_data = pd.concat(list_)
    print(file_ + " has been imported.")

这段代码工作正常，但速度很慢.最多可能需要 2 天的时间来处理.

This code works fine, but it is slow. It can take up to 2 days to process.

我得到了一个终端命令行的单行脚本，它执行相同的操作(但没有标题).此脚本需要 20 秒.

I was given a single line script for Terminal command line that does the same (but with no headers). This script takes 20 seconds.

 for f in *.csv; do cat "`pwd`/$f" | tail -n +2 >> merged.csv; done

有谁知道我如何加速第一个 Python 脚本?为了缩短时间，我想过不将它导入到 DataFrame 中，而只是连接 CSV，但我无法弄清楚.

Does anyone know how I can speed up the first Python script? To cut the time down, I have thought about not importing it into a DataFrame and just concatenating the CSVs, but I cannot figure it out.

谢谢.

推荐答案

如果你不需要内存中的 CSV，只需从输入复制到输出，那么完全避免解析和复制会便宜很多无需在内存中建立:

If you don't need the CSV in memory, just copying from input to output, it'll be a lot cheaper to avoid parsing at all, and copy without building up in memory:

import shutil
import glob


#import csv files from folder
path = r'data/US/market/merged_data'
allFiles = glob.glob(path + "/*.csv")
allFiles.sort()  # glob lacks reliable ordering, so impose your own if output order matters
with open('someoutputfile.csv', 'wb') as outfile:
    for i, fname in enumerate(allFiles):
        with open(fname, 'rb') as infile:
            if i != 0:
                infile.readline()  # Throw away header on all but first file
            # Block copy rest of file from input to output without parsing
            shutil.copyfileobj(infile, outfile)
            print(fname + " has been imported.")

就是这样；shutil.copyfileobj 处理高效的复制数据，大大减少了 Python 级别的解析和重新序列化工作.

That's it; shutil.copyfileobj handles efficiently copying the data, dramatically reducing the Python level work to parse and reserialize.

这假设所有 CSV 文件都具有相同的格式、编码、行尾等，并且标头不包含嵌入的换行符，但如果是这种情况，它比替代方案要快得多.

This assumes all the CSV files have the same format, encoding, line endings, etc., and the header doesn't contain embedded newlines, but if that's the case, it's a lot faster than the alternatives.

这篇关于将多个 csv 文件连接成具有相同标头的单个 csv - Python的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

将多个 csv 文件连接成具有相同标头的单个 csv - Python [英] Concatenating multiple csv files into a single csv with the same header - Python

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

将多个 csv 文件连接成具有相同标头的单个 csv - Python [英] Concatenating multiple csv files into a single csv with the same header - Python

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭