使用 Pandas 循环遍历 .xlsx 文件,只执行第一个文件 [英] Looping through .xlsx files using pandas, only does first file

查看:111
本文介绍了使用 Pandas 循环遍历 .xlsx 文件,只执行第一个文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的最终目标是将一个充满 .xlsx 文件的文件夹的内容合并为一个大文件.

My ultimate goal is to merge the contents of a folder full of .xlsx files into one big file.

我认为下面的代码就足够了,但它只处理第一个文件,我不明白为什么它停在那里.文件很小(约 6 KB),因此无需等待.如果我打印 f_list,它会显示完整的文件列表.那么,我哪里出错了?需要明确的是,没有返回错误,只是没有执行整个 for 循环.我觉得应该有一个简单的修复,但是我是 Python 和编码的新手,我遇到了麻烦看到了.

I thought the below code would suffice, but it only does the first file, and I can't figure out why it stops there. The files are small (~6 KB), so it shouldn't be a matter of waiting. If I print f_list, it shows the complete list of files. So, where am I going wrong? To be clear, there is no error returned, it just does not do the entire for loop. I feel like there should be a simple fix, but being new to Python and coding, I'm having trouble seeing it.

我在 Windows 8 上使用 Anaconda 执行此操作.

I'm doing this with Anaconda on Windows 8.

import pandas as pd
import glob
f_list = glob.glob("C:\\Users\\me\\dt\\xx\\*.xlsx")  # creates my file list
all_data = pd.DataFrame()             # creates my DataFrame

for f in f_list:                      # basic for loop to go through file list but doesn't
    df = pd.read_excel(f)             # reads .xlsx file
    all_data = all_data.append(df)    # appends file contents to DataFrame
all_data.to_excel("output.xlsx")      # creates new .xlsx

使用新信息进行

在尝试了一些建议的更改后,我注意到输出声称文件是空的,除了其中 1 个比其他文件稍大.如果我将它们放入 DataFrame,它会声称 DataFrame 是空的.如果我将它放入字典中,它会声称没有关联的值.这可能与文件大小有关吗?许多(如果不是大多数)这些文件有 3-5 行和 5 列.它确实看到了 12 行.

After trying some of the suggested changes, I noticed the output claiming the files are empty, except for 1 of them which is slightly larger than the others. If I put them into the DataFrame, it claims the DataFrame is empty. If I put it into the dict, it claims there are no values associated. Could this have something to do with the file size? Many, if not most, of these files have 3-5 rows with 5 columns. The one it does see has 12 rows.

推荐答案

我强烈建议将 DataFrames 读入字典:

I strongly recommend reading the DataFrames into a dict:

sheets = {f: pd.read_excel(f) for f in f_list}

一方面,这是非常易于调试的:只需检查 REPL 中的 dict.

For one thing this is very easy to debug: just inspect the dict in the REPL.

另一个是,您可以一次有效地将它们连接到一个 DataFrame 中:

Another is that you can then concat these into one DataFrame efficiently in one pass:

pd.concat(sheets.values())

注意:这比 append 快得多,后者必须在每次 append 调用时分配一个临时 DataFrame.

另一个问题是您的 glob 可能没有选择所有文件,您应该通过打印 f_list 来检查它是否正确.

An alternative issue is that your glob may not be picking up all the files, you should check that it is by printing f_list.

这篇关于使用 Pandas 循环遍历 .xlsx 文件,只执行第一个文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆