将多个csv文件串联到具有相同标头的单个csv中-Python [英] Concatenating multiple csv files into a single csv with the same header - Python

查看：92 发布时间：2020/5/23 22:49:19 python csv pandas terminal concatenation

本文介绍了将多个csv文件串联到具有相同标头的单个csv中-Python的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我目前正在使用以下代码导入6,000个csv文件(带有标题)，并将它们导出到单个csv文件(带有单个标题行).

I am currently using the below code to import 6,000 csv files (with headers) and export them into a single csv file (with a single header row).

#import csv files from folder
path =r'data/US/market/merged_data'
allFiles = glob.glob(path + "/*.csv")
stockstats_data = pd.DataFrame()
list_ = []

for file_ in allFiles:
    df = pd.read_csv(file_,index_col=None,)
    list_.append(df)
    stockstats_data = pd.concat(list_)
    print(file_ + " has been imported.")

此代码可以正常工作，但是速度很慢.处理最多可能需要2天.

This code works fine, but it is slow. It can take up to 2 days to process.

我为Terminal命令行提供了一个单行脚本，它执行相同的操作(但没有标题).该脚本需要20秒钟.

I was given a single line script for Terminal command line that does the same (but with no headers). This script takes 20 seconds.

 for f in *.csv; do cat "`pwd`/$f" | tail -n +2 >> merged.csv; done

有人知道我如何加快第一个Python脚本的速度吗?为了减少时间，我曾考虑过不将其导入到DataFrame中，而只是将CSV连接起来，但我无法弄清楚.

Does anyone know how I can speed up the first Python script? To cut the time down, I have thought about not importing it into a DataFrame and just concatenating the CSVs, but I cannot figure it out.

谢谢.

推荐答案

如果您不需要在内存中使用CSV，只需从输入复制到输出，完全避免解析并复制会便宜很多.不会在内存中累积:

If you don't need the CSV in memory, just copying from input to output, it'll be a lot cheaper to avoid parsing at all, and copy without building up in memory:

import shutil
import glob


#import csv files from folder
path = r'data/US/market/merged_data'
allFiles = glob.glob(path + "/*.csv")
allFiles.sort()  # glob lacks reliable ordering, so impose your own if output order matters
with open('someoutputfile.csv', 'wb') as outfile:
    for i, fname in enumerate(allFiles):
        with open(fname, 'rb') as infile:
            if i != 0:
                infile.readline()  # Throw away header on all but first file
            # Block copy rest of file from input to output without parsing
            shutil.copyfileobj(infile, outfile)
            print(fname + " has been imported.")

就是这样； shutil.copyfileobj 可有效地复制数据，从而大大减少了Python级的工作来解析和重新序列化.

That's it; shutil.copyfileobj handles efficiently copying the data, dramatically reducing the Python level work to parse and reserialize.

这假定所有CSV文件都具有相同的格式，编码，行尾等，并且标头不包含嵌入的换行符，但如果是这种情况，它的速度比替代格式要快得多.

This assumes all the CSV files have the same format, encoding, line endings, etc., and the header doesn't contain embedded newlines, but if that's the case, it's a lot faster than the alternatives.

这篇关于将多个csv文件串联到具有相同标头的单个csv中-Python的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

将多个csv文件串联到具有相同标头的单个csv中-Python [英] Concatenating multiple csv files into a single csv with the same header - Python

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

将多个csv文件串联到具有相同标头的单个csv中-Python [英] Concatenating multiple csv files into a single csv with the same header - Python

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭