将100个CSV文件和标头合并为一个的最快方法是什么? [英] What is the fastest way to combine 100 CSV files with headers into one?

查看:50
本文介绍了将100个CSV文件和标头合并为一个的最快方法是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

通过以下设置将带有标题的100个CSV文件合并为一个的最快方法是什么:

What is the fastest way to combine 100 CSV files with headers into one with the following setup:

  1. 文件的总大小为200 MB.(尺寸减小以使计算时间可见)
  2. 文件位于SSD上,最大速度为240 MB/s.
  3. CPU具有4个内核,因此多线程和多个进程是允许.
  4. 仅存在一个节点(对于Spark来说很重要)
  5. 可用内存为15 GB.因此文件很容易放入内存.
  6. 操作系统是Linux(Debian Jessie)
  7. 计算机实际上是Google Cloud中的n1-standard-4实例.

(包括详细的设置以使问题的范围更加具体.所做的更改是根据

(The detailed setup was included to make the scope of the question more specific. The changes were made according to the feedback here)

文件1.csv:

a,b
1,2

文件2.csv:

a,b
3,4

最终out.csv:

a,b
1,2
3,4

根据我的基准测试,所有建议的方法中最快的是纯python.有没有更快的方法?

According to my benchmarks the fastest from all the proposed methods is pure python. Is there any faster method?

基准(使用评论和帖子中的方法更新):

Method                      Time
pure python                  0.298s
sed                          1.9s
awk                          2.5s
R data.table                 4.4s
R data.table with colClasses 4.4s
Spark 2                     40.2s
python pandas          1min 11.0s

工具版本:

sed 4.2.2
awk: mawk 1.3.3 Nov 1996
Python 3.6.1
Pandas 0.20.1
R 3.4.0
data.table 1.10.4
Spark 2.1.1

Jupyter笔记本中的代码:

sed:

%%time
!head temp/in/1.csv > temp/merged_sed.csv
!sed 1d temp/in/*.csv >> temp/merged_sed.csv

纯Python所有二进制读写,其未记录的行为为"next":

Pure Python all binary read-write with undocumented behavior of "next":

%%time
with open("temp/merged_pure_python2.csv","wb") as fout:
    # first file:
    with open("temp/in/1.csv", "rb") as f:
        fout.write(f.read())
    # now the rest:    
    for num in range(2,101):
        with open("temp/in/"+str(num)+".csv", "rb") as f:
            next(f) # skip the header
            fout.write(f.read())

awk:

%%time
!awk 'NR==1; FNR==1{{next}} 1' temp/in/*.csv > temp/merged_awk.csv

R data.table:

R data.table:

%%time
%%R
filenames <- paste0("temp/in/",list.files(path="temp/in/",pattern="*.csv"))
files <- lapply(filenames, fread)
merged_data <- rbindlist(files, use.names=F)
fwrite(merged_data, file="temp/merged_R_fwrite.csv", row.names=FALSE)

带有colClasses的R data.table:

R data.table with colClasses:

%%time
%%R
filenames <- paste0("temp/in/",list.files(path="temp/in/",pattern="*.csv"))
files <- lapply(filenames, fread,colClasses=c(
    V1="integer",
    V2="integer",
    V3="integer",
    V4="integer",
    V5="integer",
    V6="integer",
    V7="integer",
    V8="integer",
    V9="integer",
    V10="integer"))
merged_data <- rbindlist(files, use.names=F)
fwrite(merged_data, file="temp/merged_R_fwrite.csv", row.names=FALSE)

火花(pyspark):

Spark (pyspark):

%%time
df = spark.read.format("csv").option("header", "true").load("temp/in/*.csv")
df.coalesce(1).write.option("header", "true").csv("temp/merged_pyspark.csv")

Python熊猫:

%%time
import pandas as pd

interesting_files = glob.glob("temp/in/*.csv")
df_list = []
for filename in sorted(interesting_files):
    df_list.append(pd.read_csv(filename))
full_df = pd.concat(df_list)

full_df.to_csv("temp/merged_pandas.csv", index=False)

数据是通过以下方式生成的:

Data was generated by:

%%R
df=data.table(replicate(10,sample(0:9,100000,rep=TRUE)))
for (i in 1:100){
    write.csv(df,paste0("temp/in/",i,".csv"), row.names=FALSE)
}

推荐答案

根据问题中的基准测试,最快的方法是使用纯Python并具有未记录的带有二进制文件的"next()"函数行为.该方法由 Stefan Pochmann

According to the benchmarks in the question the fastest method is pure Python with undocumented "next()" function behavior with binary files. The method was proposed by Stefan Pochmann

基准:

基准(使用评论和帖子中的方法更新):

Method                      Time
pure python                  0.298s
sed                          1.9s
awk                          2.5s
R data.table                 4.4s
R data.table with colClasses 4.4s
Spark 2                     40.2s
python pandas          1min 11.0s

工具版本:

sed 4.2.2
awk: mawk 1.3.3 Nov 1996
Python 3.6.1
Pandas 0.20.1
R 3.4.0
data.table 1.10.4
Spark 2.1.1

纯Python代码:

with open("temp/merged_pure_python2.csv","wb") as fout:
    # first file:
    with open("temp/in/1.csv", "rb") as f:
        fout.write(f.read())
    # now the rest:    
    for num in range(2,101):
        with open("temp/in/"+str(num)+".csv", "rb") as f:
            next(f) # skip the header
            fout.write(f.read())

这篇关于将100个CSV文件和标头合并为一个的最快方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆