PDFBOX - 使用获取的字体信息 [英] PdfBox - Get font Information using

查看:6489
本文介绍了PDFBOX - 使用获取的字体信息的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图使用广场注释得到PDF文本。我使用下面code提取使用 PDFBOX PDF文本。结果
code

I'm trying to get text from pdf using Square Annotation. I use below code to extract text from PDF using PDFBOX.
CODE

try {    
            PDDocument document = null;
            try {
                document = PDDocument.load(new File("//Users//" + usr + "//Desktop//BoldTest2 2.pdf"));
                List allPages = document.getDocumentCatalog().getAllPages();
                for (int i = 0; i < allPages.size(); i++) {
                    PDPage page = (PDPage) allPages.get(i);
                    Map<String, PDFont> pageFonts = page.getResources().getFonts();
                    List<PDAnnotation> la = page.getAnnotations();
                    for (int f = 0; f < la.size(); f++) {
                        PDAnnotation pdfAnnot = la.get(f);
                        PDFTextStripperByArea stripper = new PDFTextStripperByArea();
                        stripper.setSortByPosition(true);
                        PDRectangle rect = pdfAnnot.getRectangle();

                        float x = 0;
                        float y = 0;
                        float width = 0;
                        float height = 0;
                        int rotation = page.findRotation();

                        if (rotation == 0) {
                            x = rect.getLowerLeftX();
                            y = rect.getUpperRightY() - 2;
                            width = rect.getWidth();
                            height = rect.getHeight();
                            PDRectangle pageSize = page.findMediaBox();
                            y = pageSize.getHeight() - y;
                        }
                        Rectangle2D.Float awtRect = new Rectangle2D.Float(x, y, width, height);
                        stripper.addRegion(Integer.toString(f), awtRect);
                        stripper.extractRegions(page);
                        PrintTextLocation2 prt = new PrintTextLocation2();
                        if (pdfAnnot.getSubtype().equals("Square")) {
                            testTxt = testTxt + "\n " + stripper.getTextForRegion(Integer.toString(f));
                        }
                    }
                }
            } catch (Exception ex) {
            } finally {
                if (document != null) {
                    document.close();
                }
            }
        } catch (Exception ex) {
        }

通过使用这种code,我只能够获得PDF文本。我怎么做才能像文中的粗体 ITALIC 的一起字体信息。建议或引用是高度AP preciated。

By using this code, I am only able to get the PDF text. How do I do to get the font information like BOLD ITALIC together within the text. Advice or references are highly appreciated.

推荐答案

PDFTextStripper 这是由 PDFTextStripperByArea 扩展正常化(即删除的格式)文本(参见<一个href=\"http://svn.apache.org/repos/asf/pdfbox/trunk/pdfbox/src/main/java/org/apache/pdfbox/util/PDFTextStripper.java\"相对=nofollow> JavaDOc注释):

The PDFTextStripper which is extended by PDFTextStripperByArea normalizes (i.e., removes formatting of) the text (cf. JavaDoc comment):

* This class will take a pdf document and strip out all of the text and ignore the
* formatting and such.

如果你看看源代码,你会看到字体信息是在这个类中可用,但在打印前归的:

If you look at the source, you will see that the font information is available in this class, but it is normalized out before printing:

protected void writePage() throws IOException
{
    [...]
        List<TextPosition> line = new ArrayList<TextPosition>();
        [...]
            if(!overlap(positionY, positionHeight, maxYForLine, maxHeightForLine))
            {
                writeLine(normalize(line,isRtlDominant,hasRtl),isRtlDominant);
                line.clear();
                [...]
            }
............

在ArrayList中的 TextPosition 实例拥有所有的格式信息。解决方案可以专注于重新定义现有的方法按要求。我列出以下几个选项:

The TextPosition instances in the ArrayList have all the formatting information. Solutions can focus on re-defining the existing methods as per the requirement. I am listing a few options below:


  • 私人列表正常化(列表行,布尔isRtlDominant,布尔hasRtl)

如果您想让自己的正常化方法,可以整体复制<一个href=\"http://svn.apache.org/repos/asf/pdfbox/trunk/pdfbox/src/main/java/org/apache/pdfbox/util/PDFTextStripper.java\"相对=nofollow> PDFTextStripper 在您的项目类,并更改副本的code。让我们把这个新类 MyPDFTextStripper ,然后定义新的方法,按要求。同样复制<一个href=\"http://svn.apache.org/repos/asf/pdfbox/trunk/pdfbox/src/main/java/org/apache/pdfbox/util/PDFTextStripperByArea.java\"相对=nofollow> PDFTextStripperByArea MyPDFTextStripperByArea 这将延长 MyPDFTextStripper

If you want your own normalize method, you can copy the whole PDFTextStripper class in your project and change the code of the copy. Let's call this new class as MyPDFTextStripper and then define new method as per the requirement. Similarly copy PDFTextStripperByArea as MyPDFTextStripperByArea which would extend MyPDFTextStripper.


  • 保护无效writePage()

如果你只需要一个新的 writePage 方法,你可以简单地扩展<一个href=\"http://svn.apache.org/repos/asf/pdfbox/trunk/pdfbox/src/main/java/org/apache/pdfbox/util/PDFTextStripper.java\"相对=nofollow> PDFTextStripper 和覆盖此方法,然后创建 MyPDFTextStripperByArea 如上所述。

If you just need a new writePage method, you can simply extend PDFTextStripper, and override this method, then create MyPDFTextStripperByArea as described above.


  • 的WriteLine(正常化(线,isRtlDominant,hasRtl),isRtlDominant)

其他的解决方案可能是由某些变量存储 pre-正常化信息,然后用它覆盖的WriteLine方法。

Other solution might override writeLine method by storing the pre-normalization information in some variable and then using it.

希望这有助于。

这篇关于PDFBOX - 使用获取的字体信息的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆