具有内嵌图像的iText GetTextFromPage异常 [英] iText GetTextFromPage exception with inline image

查看:102
本文介绍了具有内嵌图像的iText GetTextFromPage异常的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我遇到的问题与此处讨论的问题相同,但尚未解决。我的目标是从现有的pdf文件中提取文本。我收到错误消息无法找到某个pdf的图像数据或EI ,我无法将其作为样本共享。它适用于其他pdf,具有以下代码

I have the same problem as was discussed here, which was not solved. My objective is to extract the text from an existing pdf file. I get the error message Could not find image data or EI for a certain pdf, which I cannot share as a sample. It works for other pdfs, with the following code

string fileURI = "C:\\Test\\Sample.pdf";
PdfReader reader = new PdfReader(fileURI);
ITextExtractionStrategy strategy = new LocationTextExtractionStrategy();
string s = PdfTextExtractor.GetTextFromPage(reader, 1, strategy);
Debug.WriteLine(s);

我正在使用iTextSharp 5.5.0并尝试更改 found == 1 找到< = 1 ,如其他帖子所示。它没有帮助。

I am using iTextSharp 5.5.0 and tried changing found == 1 to found <= 1 as suggested in other posts. It does not help.

它有助于删除pdf中的所有图像吗?我真的只需要文字。来自iText的哪些命令可以帮助我?

Would it help to remove all images in the pdf? I really just need the text. Which commands from iText could help me with this?

推荐答案

我下载了Acrobat的试用版以创建pdf文件的版本,我可以分享。打开文件并将其作为优化的PDF再次保存在Acrobat上之后,代码正常工作,我可以提取文本。

I downloaded the trial version of Acrobat to create a version of the pdf file, that I could share. After opening the file and saving it again as "Optimized PDF" over the Acrobat, the code was working and I could extract the text.

所以问题的解决方案是可能在Acrobat中打开每个文件并使用Acrobat参考使用正确的设置再次保存,然后解压缩文本。

So the solution to the problem is probably opening each file in Acrobat and saving it again with the right settings using the Acrobat reference and then extracting the text.

这篇关于具有内嵌图像的iText GetTextFromPage异常的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆