将 Excel 文件读取到 Pandas 数据框的更快方法 [英] Faster way to read Excel files to pandas dataframe
问题描述
我有一个 14MB Excel 文件,其中包含五个工作表我正在读入 Pandas 数据帧,虽然下面的代码有效,但需要 9 分钟!
I have a 14MB Excel file with five worksheets that I'm reading into a Pandas dataframe, and although the code below works, it takes 9 minutes!
有没有人有加快速度的建议?
Does anyone have suggestions for speeding it up?
import pandas as pd
def OTT_read(xl,site_name):
df = pd.read_excel(xl.io,site_name,skiprows=2,parse_dates=0,index_col=0,
usecols=[0,1,2],header=None,
names=['date_time','%s_depth'%site_name,'%s_temp'%site_name])
return df
def make_OTT_df(FILEDIR,OTT_FILE):
xl = pd.ExcelFile(FILEDIR + OTT_FILE)
site_names = xl.sheet_names
df_list = [OTT_read(xl,site_name) for site_name in site_names]
return site_names,df_list
FILEDIR='c:/downloads/'
OTT_FILE='OTT_Data_All_stations.xlsx'
site_names_OTT,df_list_OTT = make_OTT_df(FILEDIR,OTT_FILE)
推荐答案
正如其他人所建议的,csv 读取速度更快.因此,如果您在 Windows 上并拥有 Excel,则可以调用 vbscript 将 Excel 转换为 csv,然后读取 csv.我尝试了下面的脚本,大约花了 30 秒.
As others have suggested, csv reading is faster. So if you are on windows and have Excel, you could call a vbscript to convert the Excel to csv and then read the csv. I tried the script below and it took about 30 seconds.
# create a list with sheet numbers you want to process
sheets = map(str,range(1,6))
# convert each sheet to csv and then read it using read_csv
df={}
from subprocess import call
excel='C:\Users\rsignell\OTT_Data_All_stations.xlsx'
for sheet in sheets:
csv = 'C:\Users\rsignell\test' + sheet + '.csv'
call(['cscript.exe', 'C:\Users\rsignell\ExcelToCsv.vbs', excel, csv, sheet])
df[sheet]=pd.read_csv(csv)
这是创建 ExcelToCsv.vbs 脚本的一小段 Python 代码:
Here's a little snippet of python to create the ExcelToCsv.vbs script:
#write vbscript to file
vbscript="""if WScript.Arguments.Count < 3 Then
WScript.Echo "Please specify the source and the destination files. Usage: ExcelToCsv <xls/xlsx source file> <csv destination file> <worksheet number (starts at 1)>"
Wscript.Quit
End If
csv_format = 6
Set objFSO = CreateObject("Scripting.FileSystemObject")
src_file = objFSO.GetAbsolutePathName(Wscript.Arguments.Item(0))
dest_file = objFSO.GetAbsolutePathName(WScript.Arguments.Item(1))
worksheet_number = CInt(WScript.Arguments.Item(2))
Dim oExcel
Set oExcel = CreateObject("Excel.Application")
Dim oBook
Set oBook = oExcel.Workbooks.Open(src_file)
oBook.Worksheets(worksheet_number).Activate
oBook.SaveAs dest_file, csv_format
oBook.Close False
oExcel.Quit
""";
f = open('ExcelToCsv.vbs','w')
f.write(vbscript.encode('utf-8'))
f.close()
这个答案得益于在命令行上将 XLS 转换为 CSV和 csv &xlsx 文件导入到 Pandas 数据框:速度问题
这篇关于将 Excel 文件读取到 Pandas 数据框的更快方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!