iText - 获取文本片段的字体大小和系列 [英] iText - Get Font size and family of a text segment

查看:277
本文介绍了iText - 获取文本片段的字体大小和系列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在尝试从PDF文件中自动提取重要的关键字。我能够从PDF文档中获取文本信息。但现在我需要知道这些关键字的字体大小和字体系列。

I'm currently trying to automatically extract important keywords from a PDF file. I am able to get the text information out of the PDF document. But now I need to know, which font size and font family these keywords have.

我已经拥有以下代码:

Main

public static void main(String[] args) throws IOException {
    String src = "SEM_081145.pdf";

    PdfReader reader = new PdfReader(src);

    SemTextExtractionStrategy semTextExtractionStrategy = new SemTextExtractionStrategy();

    PrintWriter out = new PrintWriter(new FileOutputStream(src + ".txt"));
    Rectangle rect = new Rectangle(70, 80, 490, 580);
    RenderFilter filter = new RegionTextRenderFilter(rect);

    for (int i = 1; i <= reader.getNumberOfPages(); i++) {
        // strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), filter);
        out.println(PdfTextExtractor.getTextFromPage(reader, i, semTextExtractionStrategy));
    }
    out.flush();
    out.close();
}

我已经实现了TextExtraction策略 SemTextExtractionStrategy 看起来像这样:

And I have implemented the TextExtraction Strategy SemTextExtractionStrategy which looks like this:

public class SemTextExtractionStrategy implements TextExtractionStrategy {

private String text;

@Override
public void beginTextBlock() {
}

@Override
public void renderText(TextRenderInfo renderInfo) {
    text = renderInfo.getText();

    System.out.println(renderInfo.getFont().getFontType());

    System.out.print(text);
}

@Override
public void endTextBlock() {
}

@Override
public void renderImage(ImageRenderInfo renderInfo) {
}

@Override
public String getResultantText() {
    return text;
}
}

我可以获得FontType但是没有方法可以得到字体大小。还有其他方法或如何获取当前文本段的字体大小?

I can get the FontType but there is no method to get the font size. Is there another way or how can I get the font size of the current text segment?

或者是否还有其他库可以从TextSegments中获取字体大小?我已经看过PDFBox和PDFTextStream。来自Aspose的PDF共享软件库将完美地完成这项工作。但它非常昂贵,我需要使用一个开源项目。

Or are there any other libraries which can fetch out the font size from TextSegments? I already had a look into PDFBox, and PDFTextStream. The PDF Shareware Library from Aspose would perfectly do the job. But it's very expensive and I need to use an open source project.

推荐答案

您可以调整这个答案,特别是这段代码:

You can adapt the code provided in this answer, in particular this code snippet:

Vector curBaseline = renderInfo.GetBaseline().GetStartPoint();
Vector topRight = renderInfo.GetAscentLine().GetEndPoint();
iTextSharp.text.Rectangle rect = new iTextSharp.text.Rectangle(curBaseline[Vector.I1], curBaseline[Vector.I2], topRight[Vector.I1], topRight[Vector.I2]);
Single curFontSize = rect.Height;

这个答案在C#中,但API非常相似,以至于转换为Java应该很简单。

This answer is in C#, but the API is so similar that the conversion to Java should be straightforward.

这篇关于iText - 获取文本片段的字体大小和系列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆