如何在Pandas中使用read_excel提高处理速度? [英] How to increase process speed using read_excel in pandas?
问题描述
我需要使用 pd.read_excel 来处理一个excel文件中的每张工作表.
但是在大多数情况下,我不知道工作表名称.
所以我用它来判断excel中有几张纸:
I need use pd.read_excel to process every sheet in one excel file.
But in most cases,I did not know the sheet name.
So I use this to judge how many sheet in excel:
i_sheet_count=0
i=0
try:
df.read_excel('/tmp/1.xlsx',sheetname=i)
i_sheet_count+=1
i+=1
else:
i+=1
print(i_sheet_count)
在此过程中,我发现过程非常缓慢,
因此, read_excel 只能读取有限的行以提高速度吗?
我尝试了 nrows ,但是没有用..仍然很慢..
During the process,I found that the process is quite slow,
So,can read_excel only read limited rows to improve the speed?
I tried nrows but did not work..still slow..
推荐答案
无需猜测即可阅读所有工作表
对pd.read_excel
使用sheetname = None
自变量.这会将 all 工作表读入数据帧字典.例如:
Read all worksheets without guessing
Use sheetname = None
argument to pd.read_excel
. This will read all worksheets into a dictionary of dataframes. For example:
dfs = pd.read_excel('file.xlsx', sheetname=None)
# access 'Sheet1' worksheet
res = dfs['Sheet1']
限制行数或列数
您可以使用parse_cols
和skip_footer
参数来限制列和/或行的数量.这样可以减少读取时间,并且还可以与sheetname = None
一起使用.
Limit number of rows or columns
You can use parse_cols
and skip_footer
arguments to limit the number of columns and/or rows. This will reduce read time, and also works with sheetname = None
.
例如,以下内容将读取前3列,如果工作表中有100行,则仅读取前20列.
For example, the following will read the first 3 columns and, if your worksheet has 100 rows, it will read only the first 20.
df = pd.read_excel('file.xlsx', sheetname=None, parse_cols='A:C', skip_footer=80)
如果您希望应用特定于工作表的逻辑,可以通过提取工作表名称来实现:
If you wish to apply worksheet-specific logic, you can do so by extracting sheetnames:
sheet_names = pd.ExcelFile('file.xlsx', on_demand=True).sheet_names
dfs = {}
for sheet in sheet_names:
dfs[sheet] = pd.read_excel('file.xlsx', sheet)
提高性能
将Excel文件读入Pandas自然比其他选项(CSV,Pickle,HDF5)要慢.如果您想提高性能,强烈建议您考虑使用其他格式.
Improving performance
Reading Excel files into Pandas is naturally slower than other options (CSV, Pickle, HDF5). If you wish to improve performance, I strongly suggest you consider these other formats.
例如,一种选择是使用VBA脚本将Excel工作表转换为CSV文件;然后使用pd.read_csv
.
One option, for example, is to use a VBA script to convert your Excel worksheets to CSV files; then use pd.read_csv
.
这篇关于如何在Pandas中使用read_excel提高处理速度?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!