从python中的DOCX Word文档中提取表格 [英] Extracting tables from a DOCX Word document in python

查看:46
本文介绍了从python中的DOCX Word文档中提取表格的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试提取 DOCX Word 文档中表格的内容,而我是 xml/xpath 的新手.

I'm trying to extract a content of tables in DOCX Word document, and boy I'm new to xml/xpath.

from docx import *
document = opendocx('someFile.docx')
tableList = document.xpath('/w:tbl')

这会触发XPathEvalError: Undefined namespace prefix"错误.我确信这只是在开发脚本时第一个期待的.不幸的是,我找不到 python-docx 的教程.

This triggers "XPathEvalError: Undefined namespace prefix" error. I'm sure it's just the first one to expect while developing the script. Unfortunately, I couldn't find a tutorial for python-docx.

你能提供一个表格提取的例子吗?

Could you kindly provide an example of table extraction?

推荐答案

经过一番反复之后,我们发现需要命名空间才能正常工作.xpath 方法是合适的解决方案,它只需要先传入文档命名空间即可.

After some back and forth, we found out that a namespace was needed for this to work correctly. The xpath method is the appropriate solution, it just needs to have the document namespace passed in first.

lxml xpath 方法 包含命名空间内容的详细信息.向下查看链接中的页面以传递命名空间字典和其他详细信息.

The lxml xpath method has the details for namespace stuff. Look down the page in the link for passing a namespaces dictionary, and other details.

正如 mgierdal 在他上面的评论中所解释的:

As explained by mgierdal in his comment above:

tblList = document.xpath('//w:tbl', namespaces=document.nsmap) 有效像做梦一样.所以,据我所知 w: 是一种速记,必须是扩展为完整的命名空间名称,其字典是由 document.nsmap 提供.

tblList = document.xpath('//w:tbl', namespaces=document.nsmap) works like a dream. So, as I understand it w: is a shorthand that has to be expanded to the full namespace name, and the dictionary for that is provided by document.nsmap.

这篇关于从python中的DOCX Word文档中提取表格的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆