PDFBOX,逐行读取pdf并提取文本属性 [英] PDFBOX, Reading a pdf line by line and extracting text properties

查看:412
本文介绍了PDFBOX,逐行读取pdf并提取文本属性的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 pdfbox 从 pdf 文件中提取文本.我阅读pdf文档如下

I am using pdfbox to extract text from pdf files. I read the pdf document as follows

    PDFParser parser = null;
    String text = "";
    PDFTextStripper stripper = null;
    PDDocument pdoc = null;
    COSDocument cdoc = null;
    File file = new File("path");

    try {
        parser = new PDFParser(new FileInputStream(file));
    } catch (IOException e) {
        e.printStackTrace();
    }

    try {
        parser.parse();
        cdoc = parser.getDocument();
        stripper = new PDFTextStripper();
        pdoc = new PDDocument(cdoc);
        stripper.setStartPage(1);
        stripper.setEndPage(2);
        text = stripper.getText(pdoc);
        System.out.println(text);
    } catch (IOException e) {
        e.printStackTrace();
    }

但我想做的是逐行读取文档并从每一行中提取粗体、斜体等文本属性.如何使用 pdfbox 库实现这一点

But what I want to do is read the document line by line and to extract the text properties such as bold,italic, from each line. How can I achieve this with pdfbox library

推荐答案

从每一行中提取文本属性,例如粗体、斜体.如何使用 pdfbox 库实现这一点

extract the text properties such as bold,italic, from each line. How can I achieve this with pdfbox library

粗体和斜体等属性不是 PDF 中的一流属性.

Properties such as bold and italic are not first-class properties in a PDF.

粗体斜体可以使用

  • 不同的字体(这是更好的方法);在这种情况下,可以尝试通过

  • different fonts (which is the better way); in this case one can try to determine whether or not the fonts are bold or italic by

  • 看字体名称:它可能包含子字符串粗体"、斜体"、斜体"...

  • looking at the font name: it may contain a substring "bold", "italic", "oblique"...

查看字体的一些可选属性,例如字体粗细...

looking at some optional properties of the font, e.g. font weight...

检查嵌入的字体文件.

这些方法都不是万无一失的;或

Neither of these methods is fool-proof; or

使用与非粗体、非斜体文本相同的字体,但使用特殊技术使它们显得粗体或斜体(又名穷人的粗体),例如

using the same font as for non-bold, non-italic text but using special techniques to make them appear bold or italic (aka poor man's bold), e.g.

  • 不仅要填充字形轮廓,还要沿着它画一条粗线以获得醒目的印象,

  • not only filling the glyph contours but also drawing a thicker line along it for a bold impression,

绘制字形两次,第二次稍微移位,也是为了给人一种大胆的印象,

drawing the glyph twice, the second time slightly displaced, also for a bold impression,

使用文本或转换矩阵使字母倾斜以获得斜体印象.

using a text or transformation matrix to slant the letters for an italic impression.

通过相应地使用此类测试覆盖 PDFTextStripper 方法,您可以在 PDF 文本提取期间获得相当好的样式猜测率.

By overriding the PDFTextStripper methods with such tests accordingly, you may achieve a fairly good guess rate for styles during PDF text extraction.

这篇关于PDFBOX,逐行读取pdf并提取文本属性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆