在Java中通过pdfbox读取pdf [英] pdf reading via pdfbox in java

查看:124
本文介绍了在Java中通过pdfbox读取pdf的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用pdfbox读取pdf时遇到问题.我的实际pdf部分不可读,因此当我将不可读的部分复制并粘贴到编辑器中时,它会显示一些小方框符号,但是当我尝试通过pdfbox读取同一文件时,这些字符不会被读取(并且我不希望它们出现待阅读).我所期望的是,我至少会得到一些符号或一些随机字符,而不是实际字符.有没有办法做到这一点.该行被选中,因此它不是图像.有没有人找到任何解决方法?

I have encountered a problem while reading the pdf using pdfbox. My actual pdf is partially unreadable so when i copy and paste the unreadable part in an editor it shows little box symbols, but when i try to read the same file via pdfbox , those characters aren't read (and i don't expect them to be read). What I expect is that I at least get some symbols or some random characters instead of the actual characters. Is there any way to do that. That line is getting selected so it isn't an image. Has anyone found any workaround this?

有一个pdfbox示例,其中我们重写pdfTextStripper类下的writeString方法以获得一些额外的字体属性.我正在使用该方法来获取我的文本和一些字体属性. 所以我的问题是,为什么pdfbox不能读取每个字符(它可能打印出乱码).但就我而言,我算不上.调用该方法的次数(每个方法调用对应于每个字符),并看到没有.的方法调用与输出文本中的字符数匹配,但与总字符数不匹配pdf中的字符集.这是一个示例pdf,"Profit"一词不可读,并且pdf甚至没有显示该词的乱码,只是完全跳过了它. 这是链接. https://drive.google.com/file/d/0B_Ke2amBgdpedUNwVTR3RVlRTFE/view ?usp =分享

There is a pdfbox example where we override writeString method under pdfTextStripper class to get some extra font properties. I am using that method to get my text and some font properties. So my question was why does the pdfbox doesn't read every character(it may print gibberish). But in my case, I counted the no. of times the method was called(each method call correspond to each character) and saw that the no. of method calls did match the no.of characters in output text but didn't match the total no. of characters in the pdf. Here's a sample pdf, the word "Profit" is unreadable and pdf doesn't even display gibberish for this word, It just altogether skips it. Here's the link. https://drive.google.com/file/d/0B_Ke2amBgdpedUNwVTR3RVlRTFE/view?usp=sharing

推荐答案

第一个文件"PnL_500010_0314.pdf"

实际上,实际上无法提取截至2014年3月31日止年度的损益表"整行,还有更多内容无法提取;检查内容的原因变得很明显:此文本使用复合字体编写,该字体既没有编码,也没有 ToUnicode 条目,从而无法识别相关字符.

The first file "PnL_500010_0314.pdf"

Indeed, actually the whole line "Statement of Profit and Loss for the year ended March 31, 2014" and much more cannot be extracted; inspecting the contents the reason becomes obvious: This text is written using a composite font which neither has an Encoding nor a ToUnicode entry to allow identifying the character in question.

在调用processTextPosition(PDFTextStripper实现并从中检索其文本信息)之前不久,org.apache.pdfbox.text.PDFTextStreamEngine(从中派生PDFTextStripper)方法showGlyph包含以下代码:

The org.apache.pdfbox.text.PDFTextStreamEngine (from which PDFTextStripper is derived) method showGlyph shortly before calling processTextPosition (which PDFTextStripper implements and from which it retrieves its text information) contains this code:

// use our additional glyph list for Unicode mapping
unicode = font.toUnicode(code, glyphList);

// when there is no Unicode mapping available, Acrobat simply coerces the character code
// into Unicode, so we do the same. Subclasses of PDFStreamEngine don't necessarily want
// this, which is why we leave it until this point in PDFTextStreamEngine.
if (unicode == null)
{
    if (font instanceof PDSimpleFont)
    {
        char c = (char) code;
        unicode = new String(new char[] { c });
    }
    else
    {
        // Acrobat doesn't seem to coerce composite font's character codes, instead it
        // skips them. See the "allah2.pdf" TestTextStripper file.
        return;
    }
}

有问题的字体没有提供任何线索来提取文本.因此,这里的unicodenull.

The font in question does not offer any clues for text extraction. Thus, unicode here is null.

此外,字体是复合字体,而不是简单字体.因此,将执行else子句,甚至不会调用processTextPosition.

Furthermore, the font is composite, not simple. Thus, the else clause is executed and processTextPosition is not even called.

PDFTextStripper根本没有被告知甚至还存在截至2014年3月31日的年度损益表"行!

PDFTextStripper, therefore, is not informed at all that the line "Statement of Profit and Loss for the year ended March 31, 2014" even exists!

如果您替换

    else
    {
        // Acrobat doesn't seem to coerce composite font's character codes, instead it
        // skips them. See the "allah2.pdf" TestTextStripper file.
        return;
    }

通过某些代码设置unicodePDFTextStreamEngine.showGlyph中的

,例如使用Unicode替换字符

in PDFTextStreamEngine.showGlyph by some code setting unicode, e.g. using the Unicode replacement character

    else
    {
        // Use the Unicode replacement character to indicate an unknown character
        unicode = "\uFFFD";
    }

你会得到

57
THIRTY SEVENTH ANNUAL REPORT 2013-14
STANDALONE FINANCIAL STATEMENTS
�������������������������������������������������������������
As per our report attached. Directors
For Deloitte Haskins & Sells LLP Deepak S. Parekh Nasser Munjee R. S. Tarneja
Chartered Accountants �������� B. S. Mehta J. J. Irani
D. N. Ghosh Bimal Jalan
Keki M. Mistry S. A. Dave D. M. Sukthankar
Sanjiv V. Pilgaonkar ���������������
Partner �����������������������
Renu Sud Karnad V. Srinivasa Rangan Girish V. Koliyote
������, May 6, 2014 Managing Director ������������������ �����������������
Notes Previous Year
� in Crore � in Crore
INCOME
����������������������� 23  23,894.03  20,796.95 
���������������������������� 24  248.98  315.55 
������������ 25  54.66  35.12 
Total Revenue  24,197.67  21,147.62 
EXPENSES
Finance Cost 26  16,029.37  13,890.89 
�������������� 27  279.18  246.19 
���������������������� 28  86.98  75.68 
�������������� 29  230.03  193.43 
������������������������������ 11 & 12  31.87  23.59 
Provision for Contingencies  100.00  145.00 
Total Expenses  16,757.43  14,574.78 

PROFIT BEFORE TAX  7,440.24  6,572.84 
�����������
�������������  1,973.00  1,727.68 
�������������� 14  27.00  (3.18)
PROFIT FOR THE YEAR 3  5,440.24  4,848.34 
EARNINGS PER SHARE��������������� 2) 31
- Basic 34.89 31.84
- Diluted 34.62 31.45
�������������������������������������������������������������

不幸的是,PDFTextStreamEngine.showGlyph方法使用了一些私有类成员.因此,不能使用上面指出的更改使用原始方法代码在自己的PDFTextStripper类中简单地覆盖它.一个人要么必须在自己的类中复制PDFTextStreamEngine的几乎所有功能,要么必须诉诸Java反射,或者一个人必须自己修补PDFBox类.

Unfortunately that PDFTextStreamEngine.showGlyph method uses some private class members. Thus, one cannot simply override it in one's own PDFTextStripper class using the original method code with the change indicated above. One either has to replicate nearly all functionality of PDFTextStreamEngine in one's own class, or one has to resort to Java reflection, or one has to patch PDFBox classes themselves.

这种体系结构并不完美.

This architecture is not exactly perfect.

第二个文件的大小写是由上面引用的同一PDFBox代码引起的.这次,虽然字体很简单,但执行了另一个代码块:

The case of the second file is caused by the same piece of PDFBox code quoted above. As this time, though, the font is simple, the other code block is executed:

    if (font instanceof PDSimpleFont)
    {
        char c = (char) code;
        unicode = new String(new char[] { c });
    }

这里发生的事情纯粹是猜测:如果没有将字形代码映射到Unicode的信息,则假定映射是Latin-1,该映射将琐碎地嵌入到char中.在OP的第二个文件中可以看到,这种假设并不总是成立.

What happens here is pure guesswork: If there is no information for mapping glyph code to Unicode, let's assume the mapping is Latin-1 which embeds trivially into char. As becomes visible in the OP's second file, this assumption does not always hold.

如果您不希望PDFBox在此处做出这样的假设,也可以将上面的if块替换为

If you don't want PDFBox to make assumptions like these here, also replace the if block above by

    if (font instanceof PDSimpleFont)
    {
        // Use the Unicode replacement character to indicate an unknown character
        unicode = "\uFFFD";
    }

这导致

Aries Agro Care Private Limited
1118th Annual Report 2013-14
Balance Sheet as at 31st March, 2014
Particulars Note
No.
 As at 
31 March, 2014
Rupees
 As at
31 March, 2013
Rupees
I. EQUITY AND LIABILITIES
(1) Shareholder's Funds
(a) ������������� 3  100,000  100,000
(b) Reserves and Surplus 4  (2,673,971) ������������
 (2,573,971) ������������
(2) Current Liabilities
(a) Short Term Borrowings 5  5,805,535 �����������
(b) Trade Payables 6  159,400 ���������
(c) ������������������������� 7  2,500  22,743 
 5,967,435  5,934,756 
TOTAL  3,393,464 �����������
II. ASSETS
(1) Non-Current Assets
(a) �������������������� �  - -
 - -
(2) Current Assets
(a) ����������������������� 9  39,605 �������
(b) ����������������������������� 10  3,353,859 ����������
 3,393,464 ����������
TOTAL  3,393,464 ����������
��������������������������������
The Notes to Accounts 1 to 23 form part of these Financial Statements
As per our report of even date For and on behalf of the Board
For Kirti D. Shah & Associates 
��������������������� 
�����������������������������
Dr. Jimmy Mirchandani
Director
Kirti D. Shah 
Proprietor 
Membership No 32371
Dr. Rahul Mirchandani 
Director
Place : Mumbai. 
Date :- 26th May, 2014.

这篇关于在Java中通过pdfbox读取pdf的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆