使用 Pandas 循环遍历 .xlsx 文件,只执行第一个文件 [英] Looping through .xlsx files using pandas, only does first file
问题描述
我的最终目标是将一个充满 .xlsx 文件的文件夹的内容合并为一个大文件.
My ultimate goal is to merge the contents of a folder full of .xlsx files into one big file.
我认为下面的代码就足够了,但它只处理第一个文件,我不明白为什么它停在那里.文件很小(约 6 KB),因此无需等待.如果我打印 f_list,它会显示完整的文件列表.那么,我哪里出错了?需要明确的是,没有返回错误,只是没有执行整个 for 循环.我觉得应该有一个简单的修复,但是我是 Python 和编码的新手,我遇到了麻烦看到了.
I thought the below code would suffice, but it only does the first file, and I can't figure out why it stops there. The files are small (~6 KB), so it shouldn't be a matter of waiting. If I print f_list, it shows the complete list of files. So, where am I going wrong? To be clear, there is no error returned, it just does not do the entire for loop. I feel like there should be a simple fix, but being new to Python and coding, I'm having trouble seeing it.
我在 Windows 8 上使用 Anaconda 执行此操作.
I'm doing this with Anaconda on Windows 8.
import pandas as pd
import glob
f_list = glob.glob("C:\\Users\\me\\dt\\xx\\*.xlsx") # creates my file list
all_data = pd.DataFrame() # creates my DataFrame
for f in f_list: # basic for loop to go through file list but doesn't
df = pd.read_excel(f) # reads .xlsx file
all_data = all_data.append(df) # appends file contents to DataFrame
all_data.to_excel("output.xlsx") # creates new .xlsx
使用新信息进行
在尝试了一些建议的更改后,我注意到输出声称文件是空的,除了其中 1 个比其他文件稍大.如果我将它们放入 DataFrame,它会声称 DataFrame 是空的.如果我将它放入字典中,它会声称没有关联的值.这可能与文件大小有关吗?许多(如果不是大多数)这些文件有 3-5 行和 5 列.它确实看到了 12 行.
After trying some of the suggested changes, I noticed the output claiming the files are empty, except for 1 of them which is slightly larger than the others. If I put them into the DataFrame, it claims the DataFrame is empty. If I put it into the dict, it claims there are no values associated. Could this have something to do with the file size? Many, if not most, of these files have 3-5 rows with 5 columns. The one it does see has 12 rows.
推荐答案
我强烈建议将 DataFrames 读入字典:
I strongly recommend reading the DataFrames into a dict:
sheets = {f: pd.read_excel(f) for f in f_list}
一方面,这是非常易于调试的:只需检查 REPL 中的 dict.
For one thing this is very easy to debug: just inspect the dict in the REPL.
另一个是,您可以一次有效地将它们连接到一个 DataFrame 中:
Another is that you can then concat these into one DataFrame efficiently in one pass:
pd.concat(sheets.values())
注意:这比 append 快得多,后者必须在每次 append 调用时分配一个临时 DataFrame.
另一个问题是您的 glob 可能没有选择所有文件,您应该通过打印 f_list
来检查它是否正确.
An alternative issue is that your glob may not be picking up all the files, you should check that it is by printing f_list
.
这篇关于使用 Pandas 循环遍历 .xlsx 文件,只执行第一个文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!