iTextSharp库不从我的文件中提取文本 [英] iTextSharp library does not extract text from my file
问题描述
iTextSharp库(版本5.5.5)不从我的文件中提取文本。
我可以将pdf中的文本复制并粘贴到记事本中。
我将文件上传到这个链接。
iTextSharp library (version 5.5.5) does not extract text from my file. I can copy and paste text from pdf into Notepad. I uploaded file to this link.
源代码非常简单,适用于其他pdf文件,但对于这个有问题的文件,我得到的只是一些没有任何意义的字符。
The source code is very simple and it works for other pdf files, but for this problematic file all I get is some characters without any meaning.
var text = string.Empty;
using (var file = new File.OpenRead(path))
{
using (var reader = new PdfReader(file))
{
for (int pageNumber = 1; pageNumber <= reader.NumberOfPages; pageNumber++)
{
text += PdfTextExtractor.GetTextFromPage(reader, pageNumber);
}
}
}
任何帮助都非常感谢。
Any help is highly appreciated.
推荐答案
示例PDF中亚洲字体的PDF声明不包含 ToUnicode 地图允许从字符代码映射到Unicode。
The PDF declarations of the Asian fonts in your sample PDF do not contain a ToUnicode map to allow mapping from character codes to Unicode.
此外,它们的编码是 Identity-H ,这是一种伪编码,因为它只是映射2字节字符代码范围从0到65,535到相同的2字节CID值,因此这仍然没有定义可用于文本提取的固定编码。
Furthermore, their encoding is Identity-H which is kind of a pseudo-encoding as it merely maps 2-byte character codes ranging from 0 to 65,535 to the same 2-byte CID value, so this still doesn't define a fixed encoding usable for text extraction.
Identity-H 实际上只能与CIDFonts一起使用任何注册表,订购和补充值,以及这些 ROS 值传达实际的编码信息,从中可以导出到Unicode的映射。在您的文件中就是这种情况。
Identity-H may actually only be used with CIDFonts using any Registry, Ordering, and Supplement values, and these ROS values convey the actual encoding information from which a mapping to Unicode can be derived. This is the case in your file.
要在文本提取过程中使用这些 ROS 值,iText需要一组资源文件来定义映射不同的预定义 ROS 值。由于这些文件非常庞大,它们不是标准iText主发行版jar / dll的一部分,但必须作为单独的jar / dll文件添加到类路径中。
To make use of these ROS values during text extraction, iText needs a set of resource files defining the mappings for the different predefined ROS values. As these files are quite huge, they are not part of the standard iText main distribution jar/dll but have to be added to the class path as a separate jar/dll file.
我只使用Java版本的iText测试了这个,因为我对它更熟练。
这个jar工件的5.x版本的Maven坐标:
The Maven coordinates for the 5.x version of this jar artifact:
<dependency>
<groupId>com.itextpdf</groupId>
<artifactId>itext-asian</artifactId>
<version>5.2.0</version>
</dependency>
(由于近年来这些资源没有任何变化,因此没有5自5.2.0以来.x发布。)
(As nothing has changed in these resources in the course of the recent years, there have been no 5.x releases since 5.2.0.)
我在这里将jar添加到类路径后,我可以成功从PDF中提取亚洲字符。无论它们是100%正确,我都不能说因为我无法阅读它们。
After I added that jar to the classpath here, I could successfully extract Asian characters from your PDF. Whether they are 100% correct, I cannot say as I cannot read them.
应该有类似的iTextSharp DLL与亚洲字体资源。 (我发现了iText 7的变体,但我不确定它是否适用于5.x iTextSharp。)
There should be a similar iTextSharp DLL with Asian font resources. (I found the iText 7 variant thereof but I am not sure that that works with a 5.x iTextSharp.)
Googl'ing在一个人身上找到了一些 iTextAsian - *
, iTextAsianCmaps - *
, iTextAsian-all - *
文件......我不知道,它们中的哪一个可以使用当前的iTextSharp 5.5.12。
Googl'ing around one finds a number of iTextAsian-*
, iTextAsianCmaps-*
, and iTextAsian-all-*
files... I don't know, though, which of them work with the current iTextSharp 5.5.12.
当OP发现时,还需要另外一个注册iTextSharp的DLL(与iText / Java相比):
As the OP found out, one additionally has to register the DLLs for iTextSharp (in contrast to iText / Java):
以下是如何通知iTextSharp亚洲dll在项目中的情况。你需要添加你的文本提取类的静态构造函数:
Here is how to notify iTextSharp that Asian dlls are in the project. You need to add static constructor of yours text extraction class:
static PdfDocument()
{
iTextSharp.text.io.StreamUtil.AddToResourceSearch("iTextAsian.dll");
iTextSharp.text.io.StreamUtil.AddToResourceSearch("iTextAsianCmaps.dll");
}
这篇关于iTextSharp库不从我的文件中提取文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!