iTextSharp库不从我的文件中提取文本 [英] iTextSharp library does not extract text from my file

查看:132
本文介绍了iTextSharp库不从我的文件中提取文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

iTextSharp库(版本5.5.5)不从我的文件中提取文本。
我可以将pdf中的文本复制并粘贴到记事本中。
我将文件上传到这个链接。

iTextSharp library (version 5.5.5) does not extract text from my file. I can copy and paste text from pdf into Notepad. I uploaded file to this link.

源代码非常简单,适用于其他pdf文件,但对于这个有问题的文件,我得到的只是一些没有任何意义的字符。

The source code is very simple and it works for other pdf files, but for this problematic file all I get is some characters without any meaning.

var text = string.Empty;
using (var file = new File.OpenRead(path))
{
    using (var reader = new PdfReader(file))
    {
        for (int pageNumber = 1; pageNumber <= reader.NumberOfPages; pageNumber++)
        {
            text += PdfTextExtractor.GetTextFromPage(reader, pageNumber);
        }
    }
}

任何帮助都非常感谢。

Any help is highly appreciated.

推荐答案

示例PDF中亚洲字体的PDF声明不包含 ToUnicode 地图允许从字符代码映射到Unicode。

The PDF declarations of the Asian fonts in your sample PDF do not contain a ToUnicode map to allow mapping from character codes to Unicode.

此外,它们的编码是 Identity-H ,这是一种伪编码,因为它只是映射2字节字符代码范围从0到65,535到相同的2字节CID值,因此这仍然没有定义可用于文本提取的固定编码。

Furthermore, their encoding is Identity-H which is kind of a pseudo-encoding as it merely maps 2-byte character codes ranging from 0 to 65,535 to the same 2-byte CID value, so this still doesn't define a fixed encoding usable for text extraction.

Identity-H 实际上只能与CIDFonts一起使用任何注册表订购补充值,以及这些 ROS 值传达实际的编码信息,从中可以导出到Unicode的映射。在您的文件中就是这种情况。

Identity-H may actually only be used with CIDFonts using any Registry, Ordering, and Supplement values, and these ROS values convey the actual encoding information from which a mapping to Unicode can be derived. This is the case in your file.

要在文本提取过程中使用这些 ROS 值,iText需要一组资源文件来定义映射不同的预定义 ROS 值。由于这些文件非常庞大,它们不是标准iText主发行版jar / dll的一部分,但必须作为单独的jar / dll文件添加到类路径中。

To make use of these ROS values during text extraction, iText needs a set of resource files defining the mappings for the different predefined ROS values. As these files are quite huge, they are not part of the standard iText main distribution jar/dll but have to be added to the class path as a separate jar/dll file.

我只使用Java版本的iText测试了这个,因为我对它更熟练。

这个jar工件的5.x版本的Maven坐标:

The Maven coordinates for the 5.x version of this jar artifact:

<dependency>
    <groupId>com.itextpdf</groupId>
    <artifactId>itext-asian</artifactId>
    <version>5.2.0</version>
</dependency>

(由于近年来这些资源没有任何变化,因此没有5自5.2.0以来.x发布。)

(As nothing has changed in these resources in the course of the recent years, there have been no 5.x releases since 5.2.0.)

我在这里将jar添加到类路径后,我可以成功从PDF中提取亚洲字符。无论它们是100%正确,我都不能说因为我无法阅读它们。

After I added that jar to the classpath here, I could successfully extract Asian characters from your PDF. Whether they are 100% correct, I cannot say as I cannot read them.

应该有类似的iTextSharp DLL与亚洲字体资源。 (我发现了iText 7的变体,但我不确定它是否适用于5.x iTextSharp。)

There should be a similar iTextSharp DLL with Asian font resources. (I found the iText 7 variant thereof but I am not sure that that works with a 5.x iTextSharp.)

Googl'ing在一个人身上找到了一些 iTextAsian - * iTextAsianCmaps - * iTextAsian-all - * 文件......我不知道,它们中的哪一个可以使用当前的iTextSharp 5.5.12。

Googl'ing around one finds a number of iTextAsian-*, iTextAsianCmaps-*, and iTextAsian-all-* files... I don't know, though, which of them work with the current iTextSharp 5.5.12.

当OP发现时,还需要另外一个注册iTextSharp的DLL(与iText / Java相比):

As the OP found out, one additionally has to register the DLLs for iTextSharp (in contrast to iText / Java):


以下是如何通知iTextSharp亚洲dll在项目中的情况。你需要添加你的文本提取类的静态构造函数:

Here is how to notify iTextSharp that Asian dlls are in the project. You need to add static constructor of yours text extraction class:

static PdfDocument()
{
    iTextSharp.text.io.StreamUtil.AddToResourceSearch("iTextAsian.dll");    
    iTextSharp.text.io.StreamUtil.AddToResourceSearch("iTextAsianCmaps.dll");
}


这篇关于iTextSharp库不从我的文件中提取文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆