使用Python逐块加载Excel文件,而不是将整个文件加载到内存中 [英] Loading Excel file chunk by chunk with Python instead of loading full file into memory

查看:113
本文介绍了使用Python逐块加载Excel文件,而不是将整个文件加载到内存中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我只想从Excel文件(xlsx)中读取10行而不立即加载整个文件,因为无法在我的一台机器上(内存不足)完成该操作.

I want to read just 10 lines from Excel files (xlsx) without loading the whole file at once, as it can't be done on one of my machines (low memory).

我尝试使用

import xlrd
import pandas as pd
def open_file(path):
    xl = pd.ExcelFile(path)
    reader = xl.parse(chunksize=1000)
    for chunk in reader:
        print(chunk)

似乎文件先被加载,然后分成几部分.

It seems like the file is loaded first then divided into parts.

如何仅读取第一行?

推荐答案

由于xlsx文件的性质(本质上是一堆压缩在一起的xml文件),您无法随意戳文件个字节,并希望它成为您感兴趣的工作表中表格的第N行的开头.

Due to the nature of xlsx files (which are essentially a bunch of xml files zipped together) you can't poke the file at an arbitrary byte and hope for it to be the beginning of Nth row of the table in the sheet you are interested in.

最好的办法是使用 pandas.read_excel ,并带有skiprows(从文件顶部跳过行)和skip_footer(从文件底部跳过行).但是,这将首先将整个文件加载到内存中,然后仅解析所需的行.

The best you can do is use pandas.read_excel with the skiprows (skips rows from the top of the file) and skip_footer (skips rows from the bottom) arguments. This however will load the whole file to memory first and then parse the required rows only.

# if the file contains 300 rows, this will read the middle 100
df = pd.read_excel('/path/excel.xlsx', skiprows=100, skip_footer=100,
                   names=['col_a', 'col_b'])

请注意,您必须使用names参数手动设置标题,否则列名将是最后跳过的行.

Note that you have to set the headers manually with the names argument otherwise the column names will be the last skipped row.

如果您想使用csv,那么这是一项简单的任务,因为csv文件是纯文本文件.

If you wish to use csv instead then it is a straightforward task since csv files are plain-text files.

但是,但这是一个很大的但是,如果您真的很绝望,则可以从xlsx存档中提取相关工作表的xml文件并进行解析.不过,这将不是一件容易的事.

But, and it's a big but, if you are really desperate you can extract the relevant sheet's xml file from the xlsx archive and parse that. It's not going to be an easy task though.

一个示例xml文件,该文件表示具有单个2 X 3表的工作表. <v>标签代表单元格的值.

An example xml file that represents a sheet with a single 2 X 3 table. The <v> tags represent the cells' value.

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<worksheet xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" mc:Ignorable="x14ac" xmlns:x14ac="http://schemas.microsoft.com/office/spreadsheetml/2009/9/ac">
    <dimension ref="A1:B3"/>
    <sheetViews>
        <sheetView tabSelected="1" workbookViewId="0">
            <selection activeCell="C10" sqref="C10"/>
        </sheetView>
    </sheetViews>
    <sheetFormatPr defaultColWidth="11" defaultRowHeight="14.25" x14ac:dyDescent="0.2"/>
    <sheetData>
        <row r="1" spans="1:2" ht="15.75" x14ac:dyDescent="0.2">
            <c r="A1" t="s">
                <v>1</v>
            </c><c r="B1" s="1" t="s">
                <v>0</v>
            </c>
        </row>
        <row r="2" spans="1:2" ht="15" x14ac:dyDescent="0.2">
            <c r="A2" s="2">
                <v>1</v>
            </c><c r="B2" s="2">
                <v>4</v>
            </c>
        </row>
        <row r="3" spans="1:2" ht="15" x14ac:dyDescent="0.2">
            <c r="A3" s="2">
                <v>2</v>
            </c><c r="B3" s="2">
                <v>5</v>
            </c>
        </row>
    </sheetData>
    <pageMargins left="0.75" right="0.75" top="1" bottom="1" header="0.5" footer="0.5"/>
</worksheet>

这篇关于使用Python逐块加载Excel文件,而不是将整个文件加载到内存中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆