python将多个excel中的所有工作表附加到pandas数据框中的有效方法 [英] python efficient way to append all worksheets in multiple excel into pandas dataframe

查看:70
本文介绍了python将多个excel中的所有工作表附加到pandas数据框中的有效方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有大约 20++ 个 xlsx 文件,每个 xlsx 文件中可能包含不同数量的工作表.但谢天谢地,所有的列都是所有工作表和所有 xlsx 文件中的一部分.通过参考此处",我有一些想法.我一直在尝试几种方法来将所有 excel 文件(所有工作表)导入和附加到单个数据框(大约 400 万行记录)中.

I have around 20++ xlsx files, inside each xlsx files might contain different numbers of worksheets. But thank god, all the columns are the some in all worksheets and all xlsx files. By referring to here", i got some idea. I have been trying a few ways to import and append all excel files (all worksheet) into a single dataframe (around 4 million rows of records).

注意:我确实检查了 这里"同样,但它只包括文件级别,我的包括文件和工作表级别.

Note: i did check here" as well, but it only include file level, mine consits file and down to worksheet level.

# import all necessary package
import pandas as pd
from pathlib import Path
import glob
import sys

# set source path
source_dataset_path = "C:/Users/aaa/Desktop/Sample_dataset/"
source_dataset_list = glob.iglob(source_dataset_path + "Sales transaction *")

for file in source_dataset_list:
#xls = pd.ExcelFile(source_dataset_list[i])
    sys.stdout.write(str(file))
    sys.stdout.flush()
    xls = pd.ExcelFile(file)
    out_df = pd.DataFrame() ## create empty output dataframe

    for sheet in xls.sheet_names:
        sys.stdout.write(str(sheet))
        sys.stdout.flush() ## # View the excel files sheet names
        #df = pd.read_excel(source_dataset_list[i], sheet_name=sheet)
        df = pd.read_excel(file, sheetname=sheet)
        out_df = out_df.append(df)  ## This will append rows of one dataframe to another(just like your expected output)

问题:

我的方法就像首先读取每个 Excel 文件并获取其中的工作表列表,然后加载工作表并附加所有工作表.循环似乎不是很有效,特别是当每次追加的数据大小都增加时.

Question:

My approach is like first read the every single excel file and get a list of sheets inside it, then load the sheets and append all sheets. The looping seems not very efficient expecially when datasize increase for every append.

推荐答案

read_excel 用于返回从所有工作表名称创建的 DataFrames 的 orderdict,然后通过 concat 和最后一个 DataFrame.append 到最终DataFrame:

Use sheet_name=None in read_excel for return orderdict of DataFrames created from all sheetnames, then join together by concat and last DataFrame.append to final DataFrame:

out_df = pd.DataFrame()
for f in source_dataset_list:
    df = pd.read_excel(f, sheet_name=None)
    cdf = pd.concat(df.values())
    out_df = out_df.append(cdf,ignore_index=True)

另一种解决方案:

cdf = [pd.read_excel(excel_names, sheet_name=None).values() 
            for excel_names in source_dataset_list]

out_df = pd.concat([pd.concat(x) for x in cdf], ignore_index=True)

这篇关于python将多个excel中的所有工作表附加到pandas数据框中的有效方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆