python -docx 从word docx中提取表格 [英] python -docx to extract table from word docx

查看:415
本文介绍了python -docx 从word docx中提取表格的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我知道这是一个重复的问题,但其他答案对我不起作用.我有一个包含一张表的 word 文件.我想要那个表作为我的 python 程序的输出.我正在使用 python 3.6,我也安装了 python -docx.这是我的数据提取代码

I know this is a repeated question but the other answers did not work for me. I have a word file that consists of one table. I want that table as an output of my python program. I'm using python 3.6 and I have installed python -docx as well. Here is my code for the data extraction

from docx.api import Document

document = Document('test_word.docx')
table = document.tables[0]

data = []

keys = None
for i, row in enumerate(table.rows):
    text = (cell.text for cell in row.cells)

    if i == 0:
        keys = tuple(text)
        continue
    row_data = dict(zip(keys, text))
    data.append(row_data)
    print (data)

我想要的结果与 docx 文件完全一样.提前致谢

I want the result that exactly looks like the word docx file. Thanks in advance

推荐答案

你的代码很适合我.将它插入到数据框中怎么样?

Your code works fine for me. How about inserting it into a dataframe?

import pandas as pd
from docx.api import Document

document = Document('test_word.docx')
table = document.tables[0]

data = []

keys = None
for i, row in enumerate(table.rows):
    text = (cell.text for cell in row.cells)

    if i == 0:
        keys = tuple(text)
        continue
    row_data = dict(zip(keys, text))
    data.append(row_data)
    print (data)

df = pd.DataFrame(data)

如何显示该表中的特定行和列?我们可以使用 iloc 根据索引提取行和列

How can i display particular row and column in that table? We can extract rows and cols based on index with iloc

# iloc[row,columns] 
df.iloc[0,:].tolist() # [5,6,7,8]  - row index 0
df.iloc[:,0].tolist() # [5,9,13,17]  - column index 0
df.iloc[0,0] # 5  - cell(0,0)
df.iloc[1:,2].tolist() # [11,15,19]  - column index 2, but skip first row

等等……

但是,如果您的列有名称(在本例中是数字),您可以这样做:

However, if your columns have names (in this case it is numbers) you can do it like this:

#df["name"].tolist() 
df[1].tolist() # [5,6,7,8] - column with name 1 

<小时>

print(df)

打印,这是我的示例文档中表格的样子.

prints, which is how the table looks like in my sample doc.

    1   2   3   4
0   5   6   7   8
1   9   10  11  12
2   13  14  15  16
3   17  18  19  20

这篇关于python -docx 从word docx中提取表格的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆