将100个CSV文件和标头合并为一个的最快方法是什么? [英] What is the fastest way to combine 100 CSV files with headers into one?

查看：50 发布时间：2021/4/27 19:40:14 python r csv

本文介绍了将100个CSV文件和标头合并为一个的最快方法是什么?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

通过以下设置将带有标题的100个CSV文件合并为一个的最快方法是什么:

What is the fastest way to combine 100 CSV files with headers into one with the following setup:

文件的总大小为200 MB.(尺寸减小以使计算时间可见)
文件位于SSD上，最大速度为240 MB/s.
CPU具有4个内核，因此多线程和多个进程是允许.
仅存在一个节点(对于Spark来说很重要)
可用内存为15 GB.因此文件很容易放入内存.
操作系统是Linux(Debian Jessie)
计算机实际上是Google Cloud中的n1-standard-4实例.

(包括详细的设置以使问题的范围更加具体.所做的更改是根据

(The detailed setup was included to make the scope of the question more specific. The changes were made according to the feedback here)

文件1.csv:

a,b
1,2

文件2.csv:

a,b
3,4

最终out.csv:

a,b
1,2
3,4

根据我的基准测试，所有建议的方法中最快的是纯python.有没有更快的方法?

According to my benchmarks the fastest from all the proposed methods is pure python. Is there any faster method?

基准(使用评论和帖子中的方法更新):

Method                      Time
pure python                  0.298s
sed                          1.9s
awk                          2.5s
R data.table                 4.4s
R data.table with colClasses 4.4s
Spark 2                     40.2s
python pandas          1min 11.0s

工具版本:

sed 4.2.2
awk: mawk 1.3.3 Nov 1996
Python 3.6.1
Pandas 0.20.1
R 3.4.0
data.table 1.10.4
Spark 2.1.1

Jupyter笔记本中的代码:

sed:

%%time
!head temp/in/1.csv > temp/merged_sed.csv
!sed 1d temp/in/*.csv >> temp/merged_sed.csv

纯Python所有二进制读写，其未记录的行为为"next":

Pure Python all binary read-write with undocumented behavior of "next":

%%time
with open("temp/merged_pure_python2.csv","wb") as fout:
    # first file:
    with open("temp/in/1.csv", "rb") as f:
        fout.write(f.read())
    # now the rest:    
    for num in range(2,101):
        with open("temp/in/"+str(num)+".csv", "rb") as f:
            next(f) # skip the header
            fout.write(f.read())

awk:

%%time
!awk 'NR==1; FNR==1{{next}} 1' temp/in/*.csv > temp/merged_awk.csv

R data.table:

%%time
%%R
filenames <- paste0("temp/in/",list.files(path="temp/in/",pattern="*.csv"))
files <- lapply(filenames, fread)
merged_data <- rbindlist(files, use.names=F)
fwrite(merged_data, file="temp/merged_R_fwrite.csv", row.names=FALSE)

带有colClasses的R data.table:

R data.table with colClasses:

%%time
%%R
filenames <- paste0("temp/in/",list.files(path="temp/in/",pattern="*.csv"))
files <- lapply(filenames, fread,colClasses=c(
    V1="integer",
    V2="integer",
    V3="integer",
    V4="integer",
    V5="integer",
    V6="integer",
    V7="integer",
    V8="integer",
    V9="integer",
    V10="integer"))
merged_data <- rbindlist(files, use.names=F)
fwrite(merged_data, file="temp/merged_R_fwrite.csv", row.names=FALSE)

火花(pyspark):

Spark (pyspark):

%%time
df = spark.read.format("csv").option("header", "true").load("temp/in/*.csv")
df.coalesce(1).write.option("header", "true").csv("temp/merged_pyspark.csv")

Python熊猫:

%%time
import pandas as pd

interesting_files = glob.glob("temp/in/*.csv")
df_list = []
for filename in sorted(interesting_files):
    df_list.append(pd.read_csv(filename))
full_df = pd.concat(df_list)

full_df.to_csv("temp/merged_pandas.csv", index=False)

数据是通过以下方式生成的:

Data was generated by:

%%R
df=data.table(replicate(10,sample(0:9,100000,rep=TRUE)))
for (i in 1:100){
    write.csv(df,paste0("temp/in/",i,".csv"), row.names=FALSE)
}

According to the benchmarks in the question the fastest method is pure Python with undocumented "next()" function behavior with binary files. The method was proposed by Stefan Pochmann

基准:

基准(使用评论和帖子中的方法更新):

Method                      Time
pure python                  0.298s
sed                          1.9s
awk                          2.5s
R data.table                 4.4s
R data.table with colClasses 4.4s
Spark 2                     40.2s
python pandas          1min 11.0s

工具版本:

sed 4.2.2
awk: mawk 1.3.3 Nov 1996
Python 3.6.1
Pandas 0.20.1
R 3.4.0
data.table 1.10.4
Spark 2.1.1

纯Python代码:

with open("temp/merged_pure_python2.csv","wb") as fout:
    # first file:
    with open("temp/in/1.csv", "rb") as f:
        fout.write(f.read())
    # now the rest:    
    for num in range(2,101):
        with open("temp/in/"+str(num)+".csv", "rb") as f:
            next(f) # skip the header
            fout.write(f.read())

这篇关于将100个CSV文件和标头合并为一个的最快方法是什么?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

将100个CSV文件和标头合并为一个的最快方法是什么? [英] What is the fastest way to combine 100 CSV files with headers into one?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

将100个CSV文件和标头合并为一个的最快方法是什么? [英] What is the fastest way to combine 100 CSV files with headers into one?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭