iText PdfTextExtractor getTextFromPage exception“在文件指针处读取字符串时出错” [英] iText PdfTextExtractor getTextFromPage exception "Error reading string at file pointer"

查看:681
本文介绍了iText PdfTextExtractor getTextFromPage exception“在文件指针处读取字符串时出错”的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用iText PdfTextExtractor从PdfReader中提取文本,其中PdfReader是从字节数组创建的,

I am using iText PdfTextExtractor to extract text from the PdfReader, where the PdfReader is created from a byte array,

    byte[] pdfbytes = outputStream.toByteArray();

    PdfReader reader = new PdfReader(pdfbytes);

    int pagenumber = reader.getNumberOfPages();
    PdfTextExtractor extractor = new PdfTextExtractor(reader);

    for(int i = 1; i<= pagenumber; i++) {
        System.out.println("============PAGE NUMBER " + i + "=============" );
        String line = extractor.getTextFromPage(i);
        System.out.println(line);
    }

第一个测试pdf来自: http://www.gnostice.com/downloads/Gnostice_PathQuest.pdf
我可以打印出第一页,但是在第二页获得跟随例外

The first test pdf is from: http://www.gnostice.com/downloads/Gnostice_PathQuest.pdf I can print out the first page, but get the follow exception at the second page

异常:

Exception in thread "main" ExceptionConverter: java.io.IOException: Error reading string at file pointer 238291
at com.lowagie.text.pdf.PRTokeniser.throwError(Unknown Source)
at com.lowagie.text.pdf.PRTokeniser.nextToken(Unknown Source)
at com.lowagie.text.pdf.PdfContentParser.nextValidToken(Unknown Source)
at com.lowagie.text.pdf.PdfContentParser.readPRObject(Unknown Source)
at com.lowagie.text.pdf.PdfContentParser.parse(Unknown Source)
at com.lowagie.text.pdf.parser.PdfContentStreamProcessor.processContent(Unknown Source)
at com.lowagie.text.pdf.parser.PdfTextExtractor.getTextFromPage(Unknown Source)
at org.xxx.services.pdfparser.xxxExtensionPdfParser.main(xxxExtensionPdfParser.java:114)

其中xxxExtensionPdfParser.java:114是String line = extractor.getTextFromPage(i);

where xxxExtensionPdfParser.java:114 is String line = extractor.getTextFromPage(i);

但是在第二次测试时 http://www.irs.gov/pub/irs-pdf/fw4.pdf ,我可以得到文字内容没有例外。所以我认为必须是导致异常的第一个pdf的格式问题。

But at second test at http://www.irs.gov/pub/irs-pdf/fw4.pdf, I can get text content without exception. So i think it must be the format issue of first pdf that causes the exception.

所以我的问题是,这个格式问题是什么,无论如何都要避免它?谢谢。

So my question is, what is this format issue and is there anyway to avoid it? Thanks.

推荐答案

    byte[] pdfbytes = outputStream.toByteArray();

    PdfReader reader = new PdfReader(pdfbytes);

    int pagenumber = reader.getNumberOfPages();
    PdfTextExtractor extractor = new PdfTextExtractor(reader);

    for(int i = 1; i<= pagenumber; i++) {
        System.out.println("============PAGE NUMBER " + i + "=============" );
        String line = PdfTextExtractor.getTextFromPage(reader,i);
        System.out.println(line);
    }

用这个替换你的代码它会正常工作..

replace your code with this it will work fine..

这篇关于iText PdfTextExtractor getTextFromPage exception“在文件指针处读取字符串时出错”的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆