从python中的DOCX Word文档中提取表格 [英] Extracting tables from a DOCX Word document in python
问题描述
我正在尝试提取 DOCX Word 文档中表格的内容,而我是 xml/xpath 的新手.
I'm trying to extract a content of tables in DOCX Word document, and boy I'm new to xml/xpath.
from docx import *
document = opendocx('someFile.docx')
tableList = document.xpath('/w:tbl')
这会触发XPathEvalError: Undefined namespace prefix"错误.我确信这只是在开发脚本时第一个期待的.不幸的是,我找不到 python-docx 的教程.
This triggers "XPathEvalError: Undefined namespace prefix" error. I'm sure it's just the first one to expect while developing the script. Unfortunately, I couldn't find a tutorial for python-docx.
你能提供一个表格提取的例子吗?
Could you kindly provide an example of table extraction?
推荐答案
经过一番反复之后,我们发现需要命名空间才能正常工作.xpath 方法是合适的解决方案,它只需要先传入文档命名空间即可.
After some back and forth, we found out that a namespace was needed for this to work correctly. The xpath method is the appropriate solution, it just needs to have the document namespace passed in first.
lxml xpath 方法 包含命名空间内容的详细信息.向下查看链接中的页面以传递命名空间字典和其他详细信息.
The lxml xpath method has the details for namespace stuff. Look down the page in the link for passing a namespaces dictionary, and other details.
正如 mgierdal 在他上面的评论中所解释的:
As explained by mgierdal in his comment above:
tblList = document.xpath('//w:tbl', namespaces=document.nsmap) 有效像做梦一样.所以,据我所知 w: 是一种速记,必须是扩展为完整的命名空间名称,其字典是由 document.nsmap 提供.
tblList = document.xpath('//w:tbl', namespaces=document.nsmap) works like a dream. So, as I understand it w: is a shorthand that has to be expanded to the full namespace name, and the dictionary for that is provided by document.nsmap.
这篇关于从python中的DOCX Word文档中提取表格的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!