Python-docx:识别段落中的分页符 [英] Python-docx: identify a page break in paragraph

查看：161 发布时间：2021/7/17 20:03:57 python search python-docx page-break

本文介绍了Python-docx:识别段落中的分页符的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我逐段遍历文档，然后将每个段落文本按.(带空格的点)拆分成句子.与在整个段落文本中搜索相比，我将句子中的段落文本分割为n 以便进行更有效的文本搜索.

I iterate over document by paragraphs, then I split each paragraph text into sentences by . (dot with space). I split paragraph text in sentences in order to do more effective text search compare to search in a whole paragraph text.

然后代码在句子的每个单词中搜索错误，错误来自纠错数据库.我在下面展示了一个简化的代码:

Then the code searches error in each word of sentence, error being taken from error-correction db. I show below a simplified code:

from docx.enum.text import WD_BREAK

for paragraph in document.paragraphs:
    sentences = paragraph.text.split('. ') 
    for sentence in sentences:
        words=sentence.split(' ')
        for word in words:
            for error in error_dictionary:
                 if error in word:
                     # (A) make simple replacement
                     word = word.replace(error, correction, 1)
                     # (B) alternative replacement based on runs 
                     for run in paragraph.runs:
                         if error in run.text:
                               run.text = run.text.replace(error, correction, 1)
                         # here we may fetch page break attribute and knowing current number 
                         # find out at what page the replacement has taken place 
                         if run.page_break== WD_BREAK:
                              current_page_number +=1
                     replace_counter += 1
                     # write to a report what paragraph and what page
                     write_report(error, correction, sentence, current_page_number )  
                     # for that I need to know a page break

问题是如何识别运行(或其他段落元素)是否包含分页符?run.page_break == WD_BREAK 有效吗?@scanny 已经展示了如何添加分页符，但如何识别它?

The problem is how to identify if a run (or other paragraph element) contains a page break? Does run.page_break == WD_BREAK work? @scanny has showed how to add page break, but how to identify it?

最好是能识别出段落中的换行符.

我可以:

for run in paragraph.runs:
    if run._element.br_lst:             
        for br in run._element.br_lst:
            br_couter+=1
            print br.type

但此代码仅显示硬中断，即通过 Ctrl+Enter 插入的中断.软分页符未检测到...(软分页符是在用户不断输入直到他所在的页面用完然后流到下一页时形成的)

Yet this code shows only Hard breaks, that is, breaks inserted thru Ctrl+Enter. Soft page breaks are not detected... (Soft page break is formed when user keeps typing until the page he is on runs out then it flows on to the next page)

有什么提示吗?

Python-docx:识别段落中的分页符 [英] Python-docx: identify a page break in paragraph

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Python-docx:识别段落中的分页符 [英] Python-docx: identify a page break in paragraph

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭