使用 PDFBox 获取每一行的字体 [英] Get font of each line using PDFBox

查看:50
本文介绍了使用 PDFBox 获取每一行的字体的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有没有办法使用 PDFBox 获取 PDF 文件每一行的字体?我试过这个,但它只列出了该页面中使用的所有字体.它不显示该字体中显示的行或文本.

Listpages = doc.getDocumentCatalog().getAllPages();for(PDPage 页数:页数){地图pageFonts=page.getResources().getFonts();for(String key : pageFonts.keySet()){System.out.println(key+" - "+pageFonts.get(key));System.out.println(pageFonts.get(key).getBaseFont());}}

感谢任何输入.谢谢!

解决方案

每当您尝试使用 PDFBox 从 PDF 中提取文本(纯文本或带有样式信息)时,通常应该开始尝试使用 PDFTextStripper 班级或其亲属之一.该课程已经为您完成了 PDF 内容解析中涉及的所有繁重工作.

您可以像这样使用普通的 PDFTextStripper 类:

PDDocument 文档 = ...;PDFTextStripper 剥离器 = new PDFTextStripper();//除非您想要所有文本,否则设置剥离器开始和结束页面或书签属性String text = stripper.getText(document);

这仅返回纯文本,例如来自某种 R40 形式:

<块引用>

申请退税来自储蓄和投资如何填写此表格请填写此表格,详细说明您的收入纳税年度以上.随附的注释将帮助您(但有不是表格上每个方框的注释).如果您需要更多帮助如有此表格上的任何内容,请拨打我们的电话如上所示.如果您不是英国居民,请不要使用此表格 - 请联系我们.请不要向我们发送任何个人记录或税收带有表格的证书或代金券.我们会联系如果我们需要这些.请等待四个星期后再联系我们了解您的还款.我们会尽快向您付款.使用黑色墨水和大写字母划掉所有错误并写下下面的正确信息...

另一方面,您可以覆盖其方法 writeString(String, List) 并处理比单纯文本更多的信息.要在字体发生变化的地方添加有关所用字体名称的信息,您可以使用:

PDFTextStripper stripper = new PDFTextStripper() {String prevBaseFont = "";protected void writeString(String text, List textPositions) 抛出 IOException{StringBuilder builder = new StringBuilder();对于(TextPosition 位置:textPositions){String baseFont = position.getFont().getBaseFont();if (baseFont != null && !baseFont.equals(prevBaseFont)){builder.append('[').append(baseFont).append(']');prevBaseFont = baseFont;}builder.append(position.getCharacter());}writeString(builder.toString());}};

对于相同的表格

<块引用>

[DHSLTQ+IRModena-Bold]申请退税来自储蓄和投资如何填写此表格[OIALXD+IRModena-Regular]请在此表格中填写您的收入详情纳税年度以上.随附的注释将帮助您(但有不是表格上每个方框的注释).如果您需要更多帮助如有此表格上的任何内容,请拨打我们的电话如上所示.如果您不是英国居民,请不要使用此表格 - 请联系我们.[DHSLTQ+IRModena-Bold]请不要向我们发送任何个人记录或税务带有表格的证书或代金券.我们会联系如果我们需要这些.[OIALXD+IRModena-Regular]请等待四个星期后再联系我们还款.我们会尽快向您付款.使用黑色墨水和大写字母划掉所有错误并写下下面的正确信息...

如果您不希望字体信息与文本合并,只需在覆盖方法中创建单独的结构即可.

TextPosition 提供了更多关于它所代表的文本片段的信息.检查一下!

Is there a way to get the font of each line of a PDF file using PDFBox? I have tried this but it just lists all the fonts used in that page. It does not show what line or text is showed in that font.

List<PDPage> pages = doc.getDocumentCatalog().getAllPages();
for(PDPage page:pages)
{
Map<String,PDFont> pageFonts=page.getResources().getFonts();
for(String key : pageFonts.keySet())
   {
    System.out.println(key+" - "+pageFonts.get(key));
    System.out.println(pageFonts.get(key).getBaseFont());
    }
}

Any input is appreciated. Thanks!

解决方案

Whenever you try to extract text (plain or with styling information) from a PDF using PDFBox, you generally should start trying using the PDFTextStripper class or one of its relatives. This class already does all the heavy lifting involved in PDF content parsing for you.

You use the plain PDFTextStripper class like this:

PDDocument document = ...;
PDFTextStripper stripper = new PDFTextStripper();
// set stripper start and end page or bookmark attributes unless you want all the text
String text = stripper.getText(document);

This returns merely the plain text, e.g. from some R40 form:

Claim for repayment of tax deducted 
from savings and investments
How to fill in this form
Please fill in this form with details of your income for the
above tax year. The enclosed Notes will help you (but there is
not a note for every box on the form). If you need more help
with anything on this form, please phone us on the number
shown above.
If you are not a UK resident, do not use this form – please 
contact us.
Please do not send us any personal records, or tax
certificates or vouchers with your form. We will contact 
you if we need these.
Please allow four weeks before contacting us about your
repayment. We will pay you as quickly as possible.
Use black ink and capital letters
Cross out any mistakes and write the
correct information below
...

You can, on the other hand, overwrite its method writeString(String, List<TextPosition>) and process more information than the mere text. To add information on the name of the used font wherever the font changes, you can use this:

PDFTextStripper stripper = new PDFTextStripper() {
    String prevBaseFont = "";

    protected void writeString(String text, List<TextPosition> textPositions) throws IOException
    {
        StringBuilder builder = new StringBuilder();

        for (TextPosition position : textPositions)
        {
            String baseFont = position.getFont().getBaseFont();
            if (baseFont != null && !baseFont.equals(prevBaseFont))
            {
                builder.append('[').append(baseFont).append(']');
                prevBaseFont = baseFont;
            }
            builder.append(position.getCharacter());
        }

        writeString(builder.toString());
    }
};

For the same form you get

[DHSLTQ+IRModena-Bold]Claim for repayment of tax deducted 
from savings and investments
How to fill in this form
[OIALXD+IRModena-Regular]Please fill in this form with details of your income for the
above tax year. The enclosed Notes will help you (but there is
not a note for every box on the form). If you need more help
with anything on this form, please phone us on the number
shown above.
If you are not a UK resident, do not use this form – please 
contact us.
[DHSLTQ+IRModena-Bold]Please do not send us any personal records, or tax
certificates or vouchers with your form. We will contact 
you if we need these.
[OIALXD+IRModena-Regular]Please allow four weeks before contacting us about your
repayment. We will pay you as quickly as possible.
Use black ink and capital letters
Cross out any mistakes and write the
correct information below
...

If you don't want the font information to be merged with the text, simply create separate structures in your method overwrite.

TextPosition offers a lot more information on the piece of text it represents. Inspect it!

这篇关于使用 PDFBox 获取每一行的字体的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆