从PDF文件中突出显示的注释中提取文本 [英] Extracting text from highlighted annotations in a PDF file

查看:291
本文介绍了从PDF文件中突出显示的注释中提取文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

从昨天开始,我正尝试使用python-poppler-qt4从一个pdf文件中突出显示的注释中提取文本.

Since yesterday I'm trying to extract the text from some highlighted annotations in one pdf, using python-poppler-qt4.

根据此文档,看来我必须使用Page.text()方法获取文本,并从高光注释中传递一个Rectangle参数,我使用Annotation.boundary()获得了该参数.但是我只得到空白文本.有人能帮我吗?我在下面复制了我的代码,并添加了我正在使用的PDF的链接.感谢您的帮助!

According to this documentation, looks like I have to get the text using the Page.text() method, passing a Rectangle argument from the higlighted annotation, which I get using Annotation.boundary(). But I get only blank text. Can someone help me? I copied my code below and added a link for the PDF I am using. Thanks for any help!

import popplerqt4
import sys
import PyQt4


def main():

    doc = popplerqt4.Poppler.Document.load(sys.argv[1])
    total_annotations = 0
    for i in range(doc.numPages()):
        page = doc.page(i)
        annotations = page.annotations()
        if len(annotations) > 0:
            for annotation in annotations:
                if  isinstance(annotation, popplerqt4.Poppler.Annotation):
                    total_annotations += 1
                    if(isinstance(annotation, popplerqt4.Poppler.HighlightAnnotation)):
                        print str(page.text(annotation.boundary()))
    if total_annotations > 0:
        print str(total_annotations) + " annotation(s) found"
    else:
        print "no annotations found"

if __name__ == "__main__":
    main()

测试pdf: https://www.dropbox.com/s/10plnj67k9xd1ot/test.pdf

推荐答案

查看

Looking at the documentation for Annotations it seems that the boundary property Returns this annotation's boundary rectangle in normalized coordinates. Although this seems a strange decision we can simply scale the coordinates by the page.pageSize().width() and .height() values.

import popplerqt4
import sys
import PyQt4


def main():

    doc = popplerqt4.Poppler.Document.load(sys.argv[1])
    total_annotations = 0
    for i in range(doc.numPages()):
        #print("========= PAGE {} =========".format(i+1))
        page = doc.page(i)
        annotations = page.annotations()
        (pwidth, pheight) = (page.pageSize().width(), page.pageSize().height())
        if len(annotations) > 0:
            for annotation in annotations:
                if  isinstance(annotation, popplerqt4.Poppler.Annotation):
                    total_annotations += 1
                    if(isinstance(annotation, popplerqt4.Poppler.HighlightAnnotation)):
                        quads = annotation.highlightQuads()
                        txt = ""
                        for quad in quads:
                            rect = (quad.points[0].x() * pwidth,
                                    quad.points[0].y() * pheight,
                                    quad.points[2].x() * pwidth,
                                    quad.points[2].y() * pheight)
                            bdy = PyQt4.QtCore.QRectF()
                            bdy.setCoords(*rect)
                            txt = txt + unicode(page.text(bdy)) + ' '

                        #print("========= ANNOTATION =========")
                        print(unicode(txt))

    if total_annotations > 0:
        print str(total_annotations) + " annotation(s) found"
    else:
        print "no annotations found"

if __name__ == "__main__":
    main()

此外,我决定将.highlightQuads()连接起来,以更好地表示实际突出显示的内容.

Additionally, I decided to concatenate the .highlightQuads() to get a better representation of what was actually highlighted.

请注意,我在每个文本的四边形区域附加了明确的<space>.

Please be aware of the explicit <space> I have appended to each quad region of text.

在示例文档中,返回的QString无法直接传递给print()str(),解决方案是改为使用unicode().

In the example document the returned QString could not be passed directly to print() or str(), the solution to this was to use unicode() instead.

我希望这对帮助我的人有所帮助.

I hope this helps someone as it helped me.

注意:页面旋转可能会影响缩放比例值,我无法对此进行测试.

Note: Page rotation may affect the scaling values, I have not been able to test this.

这篇关于从PDF文件中突出显示的注释中提取文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆