如何从具有许多表的Excel工作表中解析数据帧(使用Python,可能是Pandas) [英] How to parse dataframes from an excel sheet with many tables (using Python, possibly Pandas)

查看:101
本文介绍了如何从具有许多表的Excel工作表中解析数据帧(使用Python,可能是Pandas)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在处理布局不当的excel表,我试图将其解析并写入数据库.

I'm dealing with badly laid out excel sheets which I'm trying to parse and write into a database.

每个工作表可以有多个表.尽管这些可能表的标题是已知的,但哪些表将位于任何给定的工作表上却不是,但它们在工作表上的确切位置也不是(这些表未以一致的方式对齐).我添加了两个可能的工作表布局图片来说明这一点:此布局具有两个表,而此表具有第一个表的所有表,但不在第一个表中相同的位置,外加一张桌子.

Every sheet can have multiple tables. Though the header for these possible tables are known, which tables are gonna be on any given sheet is not, neither is their exact location on the sheet (the tables don't align in a consistent way). I've added a pic of two possible sheet layout to illustrate this: This layout has two tables, while this one has all the tables of the first, but not in the same location, plus an extra table.

我所知道的:

  1. 所有可能的表标题,因此每个单独的表都可以通过其标题进行标识
  2. 表由空白单元格分隔.他们不会互相碰触.

我的问题是否可以使用一些Python模块(例如熊猫)来解决此问题?

My question Is there a clean way to deal with this using some Python module such as pandas?

我目前的做法:

我当前正在转换为.csv并解析每一行.我将每一行围绕空白单元格拆分,并处理该行的第一部分(应属于最左侧的表).该行的其余部分排队,然后以相同的方式进行处理.然后,我阅读此first_part并检查它是否为标题行.如果是这样,我将使用它来识别我正在处理的表(该表存储在全局current_df中).随后不是标头行的行将被馈送到该表中(这里我在表中使用pandas.DataFrame.)

I'm currently converting to .csv and parsing each row. I split each row around blank cells, and process the first part of the row (should belong to the leftmost table). The remainder of the row is queued and later processed in the same manner. I then read this first_part and check whether or not it's a header row. If it is, I use it to identify which table I'm dealing with (this is stored in a global current_df). Subsequent rows which are not header rows are fed into this table (here I'm using pandas.DataFrame for my tables).

到目前为止,以下代码(大多数代码不完整且未经测试,但可以传达上面的方法):

Code so far is below (mostly incomplete and untested, but it should convey the approach above):

class DFManager(object): # keeps track of current table and its headers
    current_df = None
    current_headers = []

    def set_current_df(self, df, headers):
        self.current_headers = headers
        self.current_df = df


def split_row(row, separator):
    while row and row[0] == separator:
       row.pop(0)
    while row and row[-1] == separator:
       row.pop()

    if separator in row:
        split_index = row.index(separator)
        return row[:split_index], row[split_index:]
    else:
        return row, []


def process_df_row(row, dfmgr):
    df = df_with_header(row) # returns the dataframe with these headers

    if df is None: # is not a header row, add it to current df
        df = dfmgr.current_df
        add_row_to_df(row, df)
    else:
        dfmgr.set_current_df(df, row)


# this is passed the Excel sheet
def populate_dataframes(xl_sheet):
    dfmgr = DFManager()
    row_queue = Queue()
    for row in xl_sheet:
        row_queue.put(row)

    for row in iter(row_queue.get, None):
        if not row:
            continue

        first_part, remainder = split_row(row)
        row_queue.put(remainder)

        process_df_row(first_part, dfmgr)

推荐答案

这种情况非常特殊,以至于没有现成的模块可以做到这一点的干净"方法.

This is such a specific situation that there is likely no "clean" way to do this with a ready-made module.

执行此操作的一种方法可能是使用标题信息来查找每个表的起始索引,类似于此解决方案(

One way to do this might use the header information you already have to find the starting indices of each table, something like this solution (Python Pandas - Read csv file containing multiple tables), but with an offset in the column direction as well.

一旦有了每个表格的起始位置,就需要确定宽度(先验已知,或者通过读取直到下一个空白列来发现),然后将这些列读入数据帧,直到表格结束.

Once you have the starting position of each table, you'll want to determine the widths (either known a priori or discovered by reading until the next blank column) and read those columns into a dataframe until the end of the table.

基于索引的方法而不是基于队列的方法的好处是,您无需重新发现分隔符在每一行中的位置,也无需跟踪哪些行片段属于哪个表.每行是否存在2个以上的表也是不可知的.

The benefit of an index-based method rather than a queue based method is that you do not need to re-discover where the separator is in each row or keep track of which row fragments belong to which table. It is also agnostic to the presence of >2 tables per row.

这篇关于如何从具有许多表的Excel工作表中解析数据帧(使用Python,可能是Pandas)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆