在 Python 中从 Word 文档 (.docx) 中提取突出显示的单词 [英] Extracting Highlighted Words from Word Document (.docx) in Python

查看:43
本文介绍了在 Python 中从 Word 文档 (.docx) 中提取突出显示的单词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在处理一堆 Word 文档,其中有突出显示的文本(单词)(使用颜色代码,例如黄色、蓝色、灰色),现在我想提取与每种颜色关联的突出显示的单词.我正在用 Python 编程.这是我目前所做的:

I am working with a bunch of word documents in which I have text (words) that are highlighted (using color codes e.g. yellow,blue,gray), now I want to extract the highlighted words associated with each color. I am programming in Python. Here is what I have done currently:

[python-docx][1]打开word文档,然后找到<w:r>标签,其中包含在文件.我使用了以下代码:

opened the word document with [python-docx][1] and then get to the <w:r> tag which contains the tokens (words) in the document. I have used following code:

#!/usr/bin/env python2.6
# -*- coding: ascii -*-
from docx import *
document = opendocx('test.docx')
words = document.xpath('//w:r', namespaces=document.nsmap)
for word in words:
  print word

现在我被困在检查每个单词是否有 <w:highlight> 标签的部分,并从中提取颜色代码,如果它与 <w:highlight> 中的黄色打印文本匹配代码> 标签.如果有人能指出我从解析的文件中提取单词,我将不胜感激.

Now I am stuck at the part where I check for each word if it has <w:highlight> tag and extract the color code from it and if it matches to yellow print text inside <w:t> tag. I will really appreciate if someone can point me towards extracting the word from the parsed file.

推荐答案

我以前从未使用过 python-docx,但有帮助的是我在网上找到了一段关于突出显示的文本的 XML 结构的片段:

I had never before worked with python-docx, but what helped was that I found a snippet online of how the XML structure of a highlighted piece of text lookls like:

 <w:r>
    <w:rPr>
      <w:highlight w:val="yellow"/>
    </w:rPr>
    <w:t>text that is highlighted</w:t>
  </w:r>

从那里,想出这个相对简单:

From there, it was relatively straightforward to come up with this:

from docx import *
document = opendocx(r'test.docx')
words = document.xpath('//w:r', namespaces=document.nsmap)

WPML_URI = "{http://schemas.openxmlformats.org/wordprocessingml/2006/main}"
tag_rPr = WPML_URI + 'rPr'
tag_highlight = WPML_URI + 'highlight'
tag_val = WPML_URI + 'val'

for word in words:
    for rPr in word.findall(tag_rPr):
        if rPr.find(tag_highlight).attrib[tag_val] == 'yellow':
            print word.find(tag_t).text

这篇关于在 Python 中从 Word 文档 (.docx) 中提取突出显示的单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆