使用python-docx突出显示docx文件中的单词会产生错误的结果 [英] highlighting words in an docx file using python-docx gives incorrect results

查看:336
本文介绍了使用python-docx突出显示docx文件中的单词会产生错误的结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想突出显示MS Word文档(此处为negativeList)中的特定单词,并保留文档的其余部分.我尝试从此一个采纳,但我无法使其正常运行:

from docx.enum.text import WD_COLOR_INDEX
from docx import Document
import pandas as pd
import copy
import re

doc = Document(docxFileName)

negativList = ["king", "children", "lived", "fire"]  # some examples

for paragraph in doc.paragraphs:
    for target in negativList:
        if target in paragraph.text:  # it is worth checking in detail ...

            currRuns = copy.copy(paragraph.runs)   # deep copy as we delete/clear the object
            paragraph.runs.clear()

            for run in currRuns:
                if target in run.text:
                    words = re.split('(\W)', run.text)  # split into words in order to be able to color only one
                    for word in words:
                        if word == target:
                            newRun = paragraph.add_run(word)
                            newRun.font.highlight_color = WD_COLOR_INDEX.PINK
                        else:
                            newRun = paragraph.add_run(word)
                            newRun.font.highlight_color = None
                else: # our target is not in it so we add it unchanged
                    paragraph.runs.append(run)

doc.save('output.docx')

例如,我正在使用此文本(在docx文件中):

第1章

几个世纪以前在那里住过-

国王!"我的小读者会立即说.

不,孩子们,你错了.曾几何时 木头.那不是一块昂贵的木头.离得很远.只是一个 普通的木柴块,是放进去的那些厚实的原木之一 在冬天着火,使寒冷的房间变得温暖舒适.

我的代码有多个问题:

1)第一个句子起作用,但是第二个句子出现两次.为什么?

2)在我突出显示的部分中,格式以某种方式丢失了.我可能需要将原始运行的属性复制到新创建的属性中,但是我该怎么做?

3)我松开端子-"

4)在突出显示的最后一段中,舒适和温暖"缺失了...

我需要的是这些问题的解决方案,或者我想得太多,有没有更简单的方法来进行突出显示? (类似于doc.highlight({"king":"pink"},但我在文档中未找到任何内容)?

解决方案

您并没有考虑太多,这是一个具有挑战性的问题.这是搜索替换问题的一种形式.

通过搜索Paragraph.text可以很容易地找到目标文本,但是要在保留其他格式的同时对其进行替换(或者在您的情况下添加格式)需要在Run级别进行访问,这两个都已找到. /p>

尽管有一些复杂性,但这使它具有挑战性:

  • 不能保证您的查找"目标字符串完全位于一次运行中.因此,您将需要查找包含目标字符串的 start 的运行和包含目标字符串的 end 的运行以及两者之间的运行.

    这可以通过使用字符偏移来帮助,例如国王"中的字符偏移3处出现国王". ...",其长度为4,然后确定哪个游程包含字符3,哪个游程包含字符(3 + 4).

  • 与第一个并发症有关,不能保证部分出现目标字符串的所有运行都具有相同的格式.例如,如果您的目标字符串是"粗体单词",则更新版本(添加突出显示后)将至少需要运行 3 ,一个用于"a",一个表示粗体",表示单词"(顺便说一句,运行两个空格字符中的每一个都不会改变它们的显示方式).

    如果您接受目标字符串将始终是一个单词的简化,则可以考虑为替换运行提供找到的目标运行的第一个字符(首次运行)的格式的简化方法.

因此,我想有几种可能的方法,但是一种方法是标准化"包含目标字符串的每个段落的运行,以使目标字符串出现在不同的运行中.然后,您可以将突出显示应用于该运行,就可以得到想要的结果.

要获得更多帮助,您需要缩小问题范围并提供特定的输入和输出.我将从第一个开始(也许会丢失-")(在一个单独的问题中,也许从这里开始链接),然后一个接一个地进行直到一切正常.要求受访者提出自己的测试用例太多了:)

然后您将遇到一个类似的问题:通过此代码,我运行了字符串:'世纪前……-',而尾随的-"消失了……",这对于人们来说要容易得多进行推理.

另一个不错的下一步可能是打印每次运行的文本,以使您了解它们是如何分解的.这可以让您深入了解不起作用的地方.

I would like to highlight specific words in an MS word document (here given as negativeList) and leave the rest of the document as it was before. I have tried to adopt from this one but I can not get it running as it should:

from docx.enum.text import WD_COLOR_INDEX
from docx import Document
import pandas as pd
import copy
import re

doc = Document(docxFileName)

negativList = ["king", "children", "lived", "fire"]  # some examples

for paragraph in doc.paragraphs:
    for target in negativList:
        if target in paragraph.text:  # it is worth checking in detail ...

            currRuns = copy.copy(paragraph.runs)   # deep copy as we delete/clear the object
            paragraph.runs.clear()

            for run in currRuns:
                if target in run.text:
                    words = re.split('(\W)', run.text)  # split into words in order to be able to color only one
                    for word in words:
                        if word == target:
                            newRun = paragraph.add_run(word)
                            newRun.font.highlight_color = WD_COLOR_INDEX.PINK
                        else:
                            newRun = paragraph.add_run(word)
                            newRun.font.highlight_color = None
                else: # our target is not in it so we add it unchanged
                    paragraph.runs.append(run)

doc.save('output.docx')

As example I am using this text (in a word docx file):

CHAPTER 1

Centuries ago there lived --

"A king!" my little readers will say immediately.

No, children, you are mistaken. Once upon a time there was a piece of wood. It was not an expensive piece of wood. Far from it. Just a common block of firewood, one of those thick, solid logs that are put on the fire in winter to make cold rooms cozy and warm.

There are multiple problems with my code:

1) The first sentence works but the second sentence is in twice. Why?

2) The format gets somehow lost in the part where I highlight. I would possibly need to copy the properties of the original run into the newly created ones but how do I do this?

3) I loose the terminal "--"

4) In the highlighted last paragraph the "cozy and warm" is missing ...

What I would need is a eighter a fix for these problems or maybe I am overthinking it and there is a much easier way to do the highlighting? (something like doc.highlight({"king": "pink"} but I haven't found anything in the documentation)?

解决方案

You're not overthinking it, this is a challenging problem; it is a form of the search-and-replace problem.

The target text can be located fairly easily by searching Paragraph.text, but replacing it (or in your case adding formatting) while retaining other formatting requires access at the Run level, both of which you've discovered.

There are some complications though, which is what makes it challenging:

  • There is no guarantee that your "find" target string is located entirely in a single run. So you will need to find the run containing the start of your target string and the run containing the end of your target string, as well as any in-between.

    This might be aided by using character offsets, like "King" appears at character offset 3 in '"A king!" ...', and has a length of 4, then identifying which run contains character 3 and which contains character (3+4).

  • Related to the first complication, there is no guarantee that all the runs in which the target string partly appears are formatted the same. For example, if your target string was "a bold word", the updated version (after adding highlighting) would require at least three runs, one for "a ", one for "bold", and one for " word" (btw, which run each of the two space characters appear in won't change how they appear).

    If you accept the simplification that the target string will always be a single word, you can consider the simplification of giving the replacement run the formatting of the first character (first run) of the found target runs, which is probably the usual approach.

So I suppose there are a few possible approaches, but one would be to "normalize" the runs of each paragraph containing the target string, such that the target string appeared within a distinct run. Then you could just apply highlighting to that run and you'd get the result you wanted.

To be of more help, you'll need to narrow down the problem areas and provide specific inputs and outputs. I'd start with the first one (perhaps losing the "--") (in a separate question, perhaps linked from here) and then proceed one by one until it all works. It's asking too much for a respondent to produce their own test case :)

Then you'd have a question like: "I run the string: 'Centuries ago ... --' through this code and the trailing "--" disappears ...", which is a lot easier for folks to reason through.

Another good next step might be to print out the text of each run, just so you get a sense of how they're broken up. That may give you insight into where it's not working.

这篇关于使用python-docx突出显示docx文件中的单词会产生错误的结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆