Python docx-Modify运行以针对特定单词 [英] Python docx - Modify runs to target specific words

查看:38
本文介绍了Python docx-Modify运行以针对特定单词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在开发一段用python语言编写的代码,该代码在docx文件中搜索某些变量,例如,找到单词"car"并用定义的颜色突出显示它。

我正在使用docx模块来标识和突出显示文本,并且我可以在运行级别(run.font.Highlight)上应用更改,但是由于MS Word将文本存储在一个跟踪所有更改的XML文件中,所以我要查找的单词可以拆分到不同的运行中,也可以作为一个长句子的一部分。 由于我的最终目标是针对一个或多个定义的单词,因此我正在努力实现这一点expected result

我的主要想法是运行一个函数来"清理"运行或XML文件,将我的目标单词放在单独的运行中,然后可以突出显示,但我还没有找到任何关于这方面的文档,我担心会丢失字体属性、样式等...

这是我到目前为止拥有的代码:

import docx
from docx.enum.text import WD_COLOR_INDEX
import re

doc = docx.Document('demo.docx')

words = {'car': 'RED',
         'bus': 'GREEN',
         'train station': 'BLUE'}

for word, color in words.items():
    w = re.compile(fr'{word}')
    
    for par in doc.paragraphs:
        for run in par.runs:
            s = re.findall(w, run.text)
            if s:
                run.font.highlight_color = getattr(WD_COLOR_INDEX, color)

doc.save('new.docx')

有没有人遇到过相同的问题或有过不同的方法?

谢谢

推荐答案

此函数可用于根据match.start()match.end()paragraph.text上的正则表达式匹配中获得的值来隔离段落内的游程。从那里,您可以随心所欲地更改返回运行的属性,而不会影响相邻文本:

def isolate_run(paragraph, start, end):
    """Return docx.text.Run object containing only `paragraph.text[start:end]`.

    Runs are split as required to produce a new run at the `start` that ends at `end`.
    Runs are unchanged if the indicated range of text already occupies its own run. The
    resulting run object is returned.

    `start` and `end` are as in Python slice notation. For example, the first three
    characters of the paragraph have (start, end) of (0, 3). `end` is not the index of
    the last character. These correspond to `match.start()` and `match.end()` of a regex
    match object and `s[start:end]` of Python slice notation.
    """
    rs = tuple(paragraph._p.r_lst)

    def advance_to_run_containing_start(start, end):
        """Return (r_idx, start, end) triple indicating start run and adjusted offsets.

        The start run is the run the `start` offset occurs in. The returned `start` and
        `end` values are adjusted to be relative to the start of `r_idx`.
        """
        # --- add 0 at end so `r_ends[-1] == 0` ---
        r_ends = tuple(itertools.accumulate(len(r.text) for r in rs)) + (0,)
        r_idx = 0
        while start >= r_ends[r_idx]:
            r_idx += 1
        skipped_rs_offset = r_ends[r_idx - 1]
        return rs[r_idx], r_idx, start - skipped_rs_offset, end - skipped_rs_offset

    def split_off_prefix(r, start, end):
        """Return adjusted `end` after splitting prefix off into separate run.

        Does nothing if `r` is already the start of the isolated run.
        """
        if start > 0:
            prefix_r = copy.deepcopy(r)
            r.addprevious(prefix_r)
            r.text = r.text[start:]
            prefix_r.text = prefix_r.text[:start]
        return end - start

    def split_off_suffix(r, end):
        """Split `r` at `end` such that suffix is in separate following run."""
        suffix_r = copy.deepcopy(r)
        r.addnext(suffix_r)
        r.text = r.text[:end]
        suffix_r.text = suffix_r.text[end:]

    def lengthen_run(r, r_idx, end):
        """Add prefixes of following runs to `r` until `end` is reached."""
        while len(r.text) < end:
            suffix_len_reqd = end - len(r.text)
            r_idx += 1
            next_r = rs[r_idx]
            if len(next_r.text) <= suffix_len_reqd:
                # --- subsume next run ---
                r.text = r.text + next_r.text
                next_r.getparent().remove(next_r)
                continue
            if len(next_r.text) > suffix_len_reqd:
                # --- take prefix from next run ---
                r.text = r.text + next_r.text[:suffix_len_reqd]
                next_r.text = next_r.text[suffix_len_reqd:]

    r, r_idx, start, end = advance_to_run_containing_start(start, end)
    end = split_off_prefix(r, start, end)

    # --- if run is longer than isolation-range we need to split-off a suffix run ---
    if len(r.text) > end:
        split_off_suffix(r, end)
    # --- if run is shorter than isolation-range we need to lengthen it by taking text
    # --- from subsequent runs
    elif len(r.text) < end:
        lengthen_run(r, r_idx, end)

    return Run(r, paragraph)

它比人们想象的要复杂;它肯定比我刚开始工作时想象的要复杂得多。无论如何,这都是时不时会派上用场的东西。

这篇关于Python docx-Modify运行以针对特定单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆