PDFBOX，逐行读取pdf并提取文本属性 [英] PDFBOX, Reading a pdf line by line and extracting text properties

查看：412 发布时间：2021/6/15 18:30:38 pdfbox

本文介绍了PDFBOX，逐行读取pdf并提取文本属性的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在使用 pdfbox 从 pdf 文件中提取文本.我阅读pdf文档如下

I am using pdfbox to extract text from pdf files. I read the pdf document as follows

    PDFParser parser = null;
    String text = "";
    PDFTextStripper stripper = null;
    PDDocument pdoc = null;
    COSDocument cdoc = null;
    File file = new File("path");

    try {
        parser = new PDFParser(new FileInputStream(file));
    } catch (IOException e) {
        e.printStackTrace();
    }

    try {
        parser.parse();
        cdoc = parser.getDocument();
        stripper = new PDFTextStripper();
        pdoc = new PDDocument(cdoc);
        stripper.setStartPage(1);
        stripper.setEndPage(2);
        text = stripper.getText(pdoc);
        System.out.println(text);
    } catch (IOException e) {
        e.printStackTrace();
    }

但我想做的是逐行读取文档并从每一行中提取粗体、斜体等文本属性.如何使用 pdfbox 库实现这一点

But what I want to do is read the document line by line and to extract the text properties such as bold,italic, from each line. How can I achieve this with pdfbox library

推荐答案

从每一行中提取文本属性，例如粗体、斜体.如何使用 pdfbox 库实现这一点

extract the text properties such as bold,italic, from each line. How can I achieve this with pdfbox library

粗体和斜体等属性不是 PDF 中的一流属性.

Properties such as bold and italic are not first-class properties in a PDF.

粗体或斜体可以使用

不同的字体(这是更好的方法)；在这种情况下，可以尝试通过

different fonts (which is the better way); in this case one can try to determine whether or not the fonts are bold or italic by

看字体名称:它可能包含子字符串粗体"、斜体"、斜体"...

looking at the font name: it may contain a substring "bold", "italic", "oblique"...

查看字体的一些可选属性，例如字体粗细...

looking at some optional properties of the font, e.g. font weight...

检查嵌入的字体文件.

这些方法都不是万无一失的；或

Neither of these methods is fool-proof; or

使用与非粗体、非斜体文本相同的字体，但使用特殊技术使它们显得粗体或斜体(又名穷人的粗体)，例如

using the same font as for non-bold, non-italic text but using special techniques to make them appear bold or italic (aka poor man's bold), e.g.

不仅要填充字形轮廓，还要沿着它画一条粗线以获得醒目的印象，

not only filling the glyph contours but also drawing a thicker line along it for a bold impression,

绘制字形两次，第二次稍微移位，也是为了给人一种大胆的印象，

drawing the glyph twice, the second time slightly displaced, also for a bold impression,

使用文本或转换矩阵使字母倾斜以获得斜体印象.

using a text or transformation matrix to slant the letters for an italic impression.

通过相应地使用此类测试覆盖 PDFTextStripper 方法，您可以在 PDF 文本提取期间获得相当好的样式猜测率.

By overriding the PDFTextStripper methods with such tests accordingly, you may achieve a fairly good guess rate for styles during PDF text extraction.

这篇关于PDFBOX，逐行读取pdf并提取文本属性的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

PDFBOX，逐行读取pdf并提取文本属性 [英] PDFBOX, Reading a pdf line by line and extracting text properties

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

PDFBOX，逐行读取pdf并提取文本属性 [英] PDFBOX, Reading a pdf line by line and extracting text properties

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭