Python-docx:识别段落中的分页符 [英] Python-docx: identify a page break in paragraph

查看:161
本文介绍了Python-docx:识别段落中的分页符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我逐段遍历文档,然后将每个段落文本按.(带空格的点)拆分成句子.与在整个段落文本中搜索相比,我将句子中的段落文本分割为n 以便进行更有效的文本搜索.

I iterate over document by paragraphs, then I split each paragraph text into sentences by . (dot with space). I split paragraph text in sentences in order to do more effective text search compare to search in a whole paragraph text.

然后代码在句子的每个单词中搜索错误,错误来自纠错数据库.我在下面展示了一个简化的代码:

Then the code searches error in each word of sentence, error being taken from error-correction db. I show below a simplified code:

from docx.enum.text import WD_BREAK

for paragraph in document.paragraphs:
    sentences = paragraph.text.split('. ') 
    for sentence in sentences:
        words=sentence.split(' ')
        for word in words:
            for error in error_dictionary:
                 if error in word:
                     # (A) make simple replacement
                     word = word.replace(error, correction, 1)
                     # (B) alternative replacement based on runs 
                     for run in paragraph.runs:
                         if error in run.text:
                               run.text = run.text.replace(error, correction, 1)
                         # here we may fetch page break attribute and knowing current number 
                         # find out at what page the replacement has taken place 
                         if run.page_break== WD_BREAK:
                              current_page_number +=1
                     replace_counter += 1
                     # write to a report what paragraph and what page
                     write_report(error, correction, sentence, current_page_number )  
                     # for that I need to know a page break    

问题是如何识别运行(或其他段落元素)是否包含分页符?run.page_break == WD_BREAK 有效吗?@scanny 已经展示了如何添加分页符,但如何识别它?

The problem is how to identify if a run (or other paragraph element) contains a page break? Does run.page_break == WD_BREAK work? @scanny has showed how to add page break, but how to identify it?

最好是能识别出段落中的换行符.

我可以:

for run in paragraph.runs:
    if run._element.br_lst:             
        for br in run._element.br_lst:
            br_couter+=1
            print br.type                

但此代码仅显示硬中断,即通过 Ctrl+Enter 插入的中断.软分页符未检测到...(软分页符是在用户不断输入直到他所在的页面用完然后流到下一页时形成的)

Yet this code shows only Hard breaks, that is, breaks inserted thru Ctrl+Enter. Soft page breaks are not detected... (Soft page break is formed when user keeps typing until the page he is on runs out then it flows on to the next page)

有什么提示吗?

推荐答案

对于 SoftHard 分页符,我现在使用以下内容:

For the Soft and Hard page breaks I now use the following:

for run in paragraph.runs:
    if 'lastRenderedPageBreak' in run._element.xml:  
        print 'soft page break found at run:', run.text[:20] 
    if 'w:br' in run._element.xml and 'type="page"' in run._element.xml:
        print 'hard page break found at run:', run.text[:20]

这篇关于Python-docx:识别段落中的分页符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆