PDFBOX,逐行读取pdf并提取文本属性 [英] PDFBOX, Reading a pdf line by line and extracting text properties
问题描述
我正在使用 pdfbox 从 pdf 文件中提取文本.我阅读pdf文档如下
I am using pdfbox to extract text from pdf files. I read the pdf document as follows
PDFParser parser = null;
String text = "";
PDFTextStripper stripper = null;
PDDocument pdoc = null;
COSDocument cdoc = null;
File file = new File("path");
try {
parser = new PDFParser(new FileInputStream(file));
} catch (IOException e) {
e.printStackTrace();
}
try {
parser.parse();
cdoc = parser.getDocument();
stripper = new PDFTextStripper();
pdoc = new PDDocument(cdoc);
stripper.setStartPage(1);
stripper.setEndPage(2);
text = stripper.getText(pdoc);
System.out.println(text);
} catch (IOException e) {
e.printStackTrace();
}
但我想做的是逐行读取文档并从每一行中提取粗体、斜体等文本属性.如何使用 pdfbox 库实现这一点
But what I want to do is read the document line by line and to extract the text properties such as bold,italic, from each line. How can I achieve this with pdfbox library
推荐答案
从每一行中提取文本属性,例如粗体、斜体.如何使用 pdfbox 库实现这一点
extract the text properties such as bold,italic, from each line. How can I achieve this with pdfbox library
粗体和斜体等属性不是 PDF 中的一流属性.
Properties such as bold and italic are not first-class properties in a PDF.
粗体或斜体可以使用
不同的字体(这是更好的方法);在这种情况下,可以尝试通过
different fonts (which is the better way); in this case one can try to determine whether or not the fonts are bold or italic by
看字体名称:它可能包含子字符串粗体"、斜体"、斜体"...
looking at the font name: it may contain a substring "bold", "italic", "oblique"...
查看字体的一些可选属性,例如字体粗细...
looking at some optional properties of the font, e.g. font weight...
检查嵌入的字体文件.
这些方法都不是万无一失的;或
Neither of these methods is fool-proof; or
使用与非粗体、非斜体文本相同的字体,但使用特殊技术使它们显得粗体或斜体(又名穷人的粗体),例如
using the same font as for non-bold, non-italic text but using special techniques to make them appear bold or italic (aka poor man's bold), e.g.
不仅要填充字形轮廓,还要沿着它画一条粗线以获得醒目的印象,
not only filling the glyph contours but also drawing a thicker line along it for a bold impression,
绘制字形两次,第二次稍微移位,也是为了给人一种大胆的印象,
drawing the glyph twice, the second time slightly displaced, also for a bold impression,
使用文本或转换矩阵使字母倾斜以获得斜体印象.
using a text or transformation matrix to slant the letters for an italic impression.
通过相应地使用此类测试覆盖 PDFTextStripper
方法,您可以在 PDF 文本提取期间获得相当好的样式猜测率.
By overriding the PDFTextStripper
methods with such tests accordingly, you may achieve a fairly good guess rate for styles during PDF text extraction.
这篇关于PDFBOX,逐行读取pdf并提取文本属性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!