如何在python-docx中提取带有跟踪更改的文本 [英] How to extract text inserted with track-changes in python-docx

查看:22
本文介绍了如何在python-docx中提取带有跟踪更改的文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从在修订"模式下编辑的 Word 文档中提取文本.我想提取插入的文本并忽略删除的文本.

I want to extract text from word documents that were edited in "Track Changes" mode. I want to extract the inserted text and ignore the deleted text.

运行下面的代码,我看到在跟踪更改"模式下插入的段落返回一个空的 Paragraph.text

Running the below code I saw that paragraphs inserted in "track changes" mode return an empty Paragraph.text

import docx

doc = docx.Document('C:\test track changes.docx')

for para in doc.paragraphs:
    print(para)
    print(para.text)

有没有办法在修订后的插入(w:ins 元素)中检索文本?

Is there a way to retrieve the text in revisioned inserts (w:ins elements) ?

我使用的是 python-docx 0.8.6、lxml 3.4.0、python 3.4、Win7

I'm using python-docx 0.8.6, lxml 3.4.0, python 3.4, Win7

谢谢

推荐答案

不直接使用python-docx;尚无针对跟踪更改/修订的 API 支持.

Not directly using python-docx; there's no API support yet for tracked changes/revisions.

这是一项非常棘手的工作,如果您搜索元素名称,可能会以open xml w:ins"作为开始,您会发现它会显示此文档作为第一个结果:https://msdn.microsoft.com/en-us/library/ee836138(v=office.12).aspx

It's a pretty tricky job, which you'll discover if you search on the element names, perhaps 'open xml w:ins' for a start, that brings up this document as the first result: https://msdn.microsoft.com/en-us/library/ee836138(v=office.12).aspx

如果我需要在紧要关头做类似的事情,我会使用:

If I needed to do something like that in a pinch I'd get the body element using:

body = document._body._body

然后在其上使用 XPath 返回我想要的元素,有点像这个空码:

and then use XPath on that to return the elements I wanted, something vaguely like this aircode:

from docx.text.paragraph import Paragraph

inserted_ps = body.xpath('./w:ins//w:p')
for p in inserted_ps:
    paragraph = Paragraph(p, None)
    print(paragraph.text)

您将自己确定什么样的 XPath 表达式可以为您提供所需的段落.

You'll be on your own for figuring out what XPath expression will get you the paragraphs you want.

opc-diag 可能是这方面的朋友,它可以让您快速扫描 .docx 包的 XML.http://opc-diag.readthedocs.io/en/latest/index.html

opc-diag may be a friend in this, allowing you to quickly scan the XML of the .docx package. http://opc-diag.readthedocs.io/en/latest/index.html

这篇关于如何在python-docx中提取带有跟踪更改的文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆