加快将多个csv导入python数据框的过程 [英] speed up the process of import multiple csv into python dataframe

查看:107
本文介绍了加快将多个csv导入python数据框的过程的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从目标目录中读取多个CSV文件(数百个文件,每行数百行,但具有相同的列数)到单个Python Pandas DataFrame中.

I would like to read multiple CSV files (hundreds of files,hundreds of lines each but with the same number of columns) from a target directory into a single Python Pandas DataFrame.

我写的下面的代码可以工作,但是太慢了-运行30个文件需要几分钟的时间(所以如果我加载所有文件我应该等多久).我可以进行哪些更改以使其更快地工作?

The code below I wrote works but too slow.It takes minutes to run 30 files(so how long should I wait if I load all of my files). What can I alter to make it work faster?

此外,在replace函数中,我想将"_"(不知道编码,但不是正常的编码)替换为-"(正常的utf-8),该如何处理那?我使用coding=latin-1因为文件中有法国口音.

Besides, in replacefunction, I want to replace a "_"(don't know the encoding, but not a normal one) to a "-"(normal utf-8), how can I do with that? I use coding=latin-1because I have french accents in the files.

#coding=latin-1

import pandas as pd
import glob

pd.set_option('expand_frame_repr', False)

path = r'D:\Python27\mypfe\data_test'
allFiles = glob.glob(path + "/*.csv")
frame = pd.DataFrame()
list_ = []
for file_ in allFiles:
    df = pd.read_csv(file_, index_col = None, header = 0, sep = ';', dayfirst = True, 
                     parse_dates=['HeurePrevue','HeureDebutTrajet','HeureArriveeSurSite','HeureEffective'])
    df.drop(labels=['aPaye','MethodePaiement','ArgentPercu'],axis=1,inplace=True)
    df['Sens'].replace("\n", "-", inplace=True,regex=True)
    list_.append(df)

    print "fichier lu:",file_

frame = pd.concat(list_)

print frame

推荐答案

您可以尝试以下操作-仅读取真正需要的列,使用列表理解并调用pd.concat([ ... ], ignore_index=True),因为它非常慢:

you may try the following - read only those columns that really need, use list comprehension and call pd.concat([ ... ], ignore_index=True) once, because it's pretty slow:

# there is no sense to read columns that you don't need
# specify the column list (EXCLUDING: 'aPaye','MethodePaiement','ArgentPercu')
cols = ['col1', 'col2', 'etc.']
date_cols = ['HeurePrevue','HeureDebutTrajet','HeureArriveeSurSite','HeureEffective']

df = pd.concat(
        [pd.read_csv(f, sep = ';', dayfirst = True, usecols=cols, 
                     parse_dates=date_cols)
         for f in allFiles
        ],
        ignore_index=True
     )

如果您有足够的内存来存储两个生成的DF ...

this should work if you have enough memory to store two resulting DFs...

这篇关于加快将多个csv导入python数据框的过程的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆