如何从具有许多表的Excel工作表中解析数据帧(使用Python，可能是Pandas) [英] How to parse dataframes from an excel sheet with many tables (using Python, possibly Pandas)

查看：101 发布时间：2020/5/24 2:20:24 python excel pandas

本文介绍了如何从具有许多表的Excel工作表中解析数据帧(使用Python，可能是Pandas)的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在处理布局不当的excel表，我试图将其解析并写入数据库.

I'm dealing with badly laid out excel sheets which I'm trying to parse and write into a database.

每个工作表可以有多个表.尽管这些可能表的标题是已知的，但哪些表将位于任何给定的工作表上却不是，但它们在工作表上的确切位置也不是(这些表未以一致的方式对齐).我添加了两个可能的工作表布局图片来说明这一点:此布局具有两个表，而此表具有第一个表的所有表，但不在第一个表中相同的位置，外加一张桌子.

Every sheet can have multiple tables. Though the header for these possible tables are known, which tables are gonna be on any given sheet is not, neither is their exact location on the sheet (the tables don't align in a consistent way). I've added a pic of two possible sheet layout to illustrate this: This layout has two tables, while this one has all the tables of the first, but not in the same location, plus an extra table.

我所知道的:

所有可能的表标题，因此每个单独的表都可以通过其标题进行标识
表由空白单元格分隔.他们不会互相碰触.

我的问题是否可以使用一些Python模块(例如熊猫)来解决此问题?

My question Is there a clean way to deal with this using some Python module such as pandas?

我目前的做法:

我当前正在转换为.csv并解析每一行.我将每一行围绕空白单元格拆分，并处理该行的第一部分(应属于最左侧的表).该行的其余部分排队，然后以相同的方式进行处理.然后，我阅读此first_part并检查它是否为标题行.如果是这样，我将使用它来识别我正在处理的表(该表存储在全局current_df中).随后不是标头行的行将被馈送到该表中(这里我在表中使用pandas.DataFrame.)

I'm currently converting to .csv and parsing each row. I split each row around blank cells, and process the first part of the row (should belong to the leftmost table). The remainder of the row is queued and later processed in the same manner. I then read this first_part and check whether or not it's a header row. If it is, I use it to identify which table I'm dealing with (this is stored in a global current_df). Subsequent rows which are not header rows are fed into this table (here I'm using pandas.DataFrame for my tables).

到目前为止，以下代码(大多数代码不完整且未经测试，但可以传达上面的方法):

Code so far is below (mostly incomplete and untested, but it should convey the approach above):

class DFManager(object): # keeps track of current table and its headers
    current_df = None
    current_headers = []

    def set_current_df(self, df, headers):
        self.current_headers = headers
        self.current_df = df


def split_row(row, separator):
    while row and row[0] == separator:
       row.pop(0)
    while row and row[-1] == separator:
       row.pop()

    if separator in row:
        split_index = row.index(separator)
        return row[:split_index], row[split_index:]
    else:
        return row, []


def process_df_row(row, dfmgr):
    df = df_with_header(row) # returns the dataframe with these headers

    if df is None: # is not a header row, add it to current df
        df = dfmgr.current_df
        add_row_to_df(row, df)
    else:
        dfmgr.set_current_df(df, row)


# this is passed the Excel sheet
def populate_dataframes(xl_sheet):
    dfmgr = DFManager()
    row_queue = Queue()
    for row in xl_sheet:
        row_queue.put(row)

    for row in iter(row_queue.get, None):
        if not row:
            continue

        first_part, remainder = split_row(row)
        row_queue.put(remainder)

        process_df_row(first_part, dfmgr)

如何从具有许多表的Excel工作表中解析数据帧(使用Python，可能是Pandas) [英] How to parse dataframes from an excel sheet with many tables (using Python, possibly Pandas)

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何从具有许多表的Excel工作表中解析数据帧(使用Python，可能是Pandas) [英] How to parse dataframes from an excel sheet with many tables (using Python, possibly Pandas)

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭