使用 Pandas 循环遍历 .xlsx 文件，只执行第一个文件 [英] Looping through .xlsx files using pandas, only does first file

查看：111 发布时间：2021/6/26 20:45:15 python pandas python-3.4

本文介绍了使用 Pandas 循环遍历 .xlsx 文件，只执行第一个文件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我的最终目标是将一个充满 .xlsx 文件的文件夹的内容合并为一个大文件.

My ultimate goal is to merge the contents of a folder full of .xlsx files into one big file.

我认为下面的代码就足够了，但它只处理第一个文件，我不明白为什么它停在那里.文件很小(约 6 KB)，因此无需等待.如果我打印 f_list，它会显示完整的文件列表.那么，我哪里出错了?需要明确的是，没有返回错误，只是没有执行整个 for 循环.我觉得应该有一个简单的修复，但是我是 Python 和编码的新手，我遇到了麻烦看到了.

I thought the below code would suffice, but it only does the first file, and I can't figure out why it stops there. The files are small (~6 KB), so it shouldn't be a matter of waiting. If I print f_list, it shows the complete list of files. So, where am I going wrong? To be clear, there is no error returned, it just does not do the entire for loop. I feel like there should be a simple fix, but being new to Python and coding, I'm having trouble seeing it.

我在 Windows 8 上使用 Anaconda 执行此操作.

I'm doing this with Anaconda on Windows 8.

import pandas as pd
import glob
f_list = glob.glob("C:\\Users\\me\\dt\\xx\\*.xlsx")  # creates my file list
all_data = pd.DataFrame()             # creates my DataFrame

for f in f_list:                      # basic for loop to go through file list but doesn't
    df = pd.read_excel(f)             # reads .xlsx file
    all_data = all_data.append(df)    # appends file contents to DataFrame
all_data.to_excel("output.xlsx")      # creates new .xlsx

使用新信息进行

在尝试了一些建议的更改后，我注意到输出声称文件是空的，除了其中 1 个比其他文件稍大.如果我将它们放入 DataFrame，它会声称 DataFrame 是空的.如果我将它放入字典中，它会声称没有关联的值.这可能与文件大小有关吗?许多(如果不是大多数)这些文件有 3-5 行和 5 列.它确实看到了 12 行.

After trying some of the suggested changes, I noticed the output claiming the files are empty, except for 1 of them which is slightly larger than the others. If I put them into the DataFrame, it claims the DataFrame is empty. If I put it into the dict, it claims there are no values associated. Could this have something to do with the file size? Many, if not most, of these files have 3-5 rows with 5 columns. The one it does see has 12 rows.

推荐答案

我强烈建议将 DataFrames 读入字典:

I strongly recommend reading the DataFrames into a dict:

sheets = {f: pd.read_excel(f) for f in f_list}

一方面，这是非常易于调试的:只需检查 REPL 中的 dict.

For one thing this is very easy to debug: just inspect the dict in the REPL.

另一个是，您可以一次有效地将它们连接到一个 DataFrame 中:

Another is that you can then concat these into one DataFrame efficiently in one pass:

pd.concat(sheets.values())

注意:这比 append 快得多，后者必须在每次 append 调用时分配一个临时 DataFrame.

另一个问题是您的 glob 可能没有选择所有文件，您应该通过打印 f_list 来检查它是否正确.

An alternative issue is that your glob may not be picking up all the files, you should check that it is by printing f_list.

这篇关于使用 Pandas 循环遍历 .xlsx 文件，只执行第一个文件的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用 Pandas 循环遍历 .xlsx 文件，只执行第一个文件 [英] Looping through .xlsx files using pandas, only does first file

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

使用 Pandas 循环遍历 .xlsx 文件，只执行第一个文件 [英] Looping through .xlsx files using pandas, only does first file

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭