iTextSharp库不从我的文件中提取文本 [英] iTextSharp library does not extract text from my file

查看：132 发布时间：2018/11/16 16:37:01 c# itext

本文介绍了iTextSharp库不从我的文件中提取文本的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

iTextSharp库（版本5.5.5）不从我的文件中提取文本。
我可以将pdf中的文本复制并粘贴到记事本中。
我将文件上传到这个链接。

iTextSharp library (version 5.5.5) does not extract text from my file. I can copy and paste text from pdf into Notepad. I uploaded file to this link.

源代码非常简单，适用于其他pdf文件，但对于这个有问题的文件，我得到的只是一些没有任何意义的字符。

The source code is very simple and it works for other pdf files, but for this problematic file all I get is some characters without any meaning.

var text = string.Empty;
using (var file = new File.OpenRead(path))
{
    using (var reader = new PdfReader(file))
    {
        for (int pageNumber = 1; pageNumber <= reader.NumberOfPages; pageNumber++)
        {
            text += PdfTextExtractor.GetTextFromPage(reader, pageNumber);
        }
    }
}

任何帮助都非常感谢。

Any help is highly appreciated.

推荐答案

示例PDF中亚洲字体的PDF声明不包含 ToUnicode 地图允许从字符代码映射到Unicode。

The PDF declarations of the Asian fonts in your sample PDF do not contain a ToUnicode map to allow mapping from character codes to Unicode.

此外，它们的编码是 Identity-H ，这是一种伪编码，因为它只是映射2字节字符代码范围从0到65,535到相同的2字节CID值，因此这仍然没有定义可用于文本提取的固定编码。

Furthermore, their encoding is Identity-H which is kind of a pseudo-encoding as it merely maps 2-byte character codes ranging from 0 to 65,535 to the same 2-byte CID value, so this still doesn't define a fixed encoding usable for text extraction.

Identity-H 实际上只能与CIDFonts一起使用任何注册表，订购和补充值，以及这些 ROS 值传达实际的编码信息，从中可以导出到Unicode的映射。在您的文件中就是这种情况。

Identity-H may actually only be used with CIDFonts using any Registry, Ordering, and Supplement values, and these ROS values convey the actual encoding information from which a mapping to Unicode can be derived. This is the case in your file.

要在文本提取过程中使用这些 ROS 值，iText需要一组资源文件来定义映射不同的预定义 ROS 值。由于这些文件非常庞大，它们不是标准iText主发行版jar / dll的一部分，但必须作为单独的jar / dll文件添加到类路径中。

To make use of these ROS values during text extraction, iText needs a set of resource files defining the mappings for the different predefined ROS values. As these files are quite huge, they are not part of the standard iText main distribution jar/dll but have to be added to the class path as a separate jar/dll file.

我只使用Java版本的iText测试了这个，因为我对它更熟练。

这个jar工件的5.x版本的Maven坐标：

The Maven coordinates for the 5.x version of this jar artifact:

<dependency>
    <groupId>com.itextpdf</groupId>
    <artifactId>itext-asian</artifactId>
    <version>5.2.0</version>
</dependency>

（由于近年来这些资源没有任何变化，因此没有5自5.2.0以来.x发布。）

(As nothing has changed in these resources in the course of the recent years, there have been no 5.x releases since 5.2.0.)

我在这里将jar添加到类路径后，我可以成功从PDF中提取亚洲字符。无论它们是100％正确，我都不能说因为我无法阅读它们。

After I added that jar to the classpath here, I could successfully extract Asian characters from your PDF. Whether they are 100% correct, I cannot say as I cannot read them.

应该有类似的iTextSharp DLL与亚洲字体资源。（我发现了iText 7的变体，但我不确定它是否适用于5.x iTextSharp。）

There should be a similar iTextSharp DLL with Asian font resources. (I found the iText 7 variant thereof but I am not sure that that works with a 5.x iTextSharp.)

Googl'ing在一个人身上找到了一些 iTextAsian - * ， iTextAsianCmaps - * ， iTextAsian-all - * 文件......我不知道，它们中的哪一个可以使用当前的iTextSharp 5.5.12。

Googl'ing around one finds a number of iTextAsian-*, iTextAsianCmaps-*, and iTextAsian-all-* files... I don't know, though, which of them work with the current iTextSharp 5.5.12.

当OP发现时，还需要另外一个注册iTextSharp的DLL（与iText / Java相比）：

As the OP found out, one additionally has to register the DLLs for iTextSharp (in contrast to iText / Java):

以下是如何通知iTextSharp亚洲dll在项目中的情况。你需要添加你的文本提取类的静态构造函数：

Here is how to notify iTextSharp that Asian dlls are in the project. You need to add static constructor of yours text extraction class:

static PdfDocument()
{
    iTextSharp.text.io.StreamUtil.AddToResourceSearch("iTextAsian.dll");    
    iTextSharp.text.io.StreamUtil.AddToResourceSearch("iTextAsianCmaps.dll");
}

这篇关于iTextSharp库不从我的文件中提取文本的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

iTextSharp库不从我的文件中提取文本 [英] iTextSharp library does not extract text from my file

问题描述

推荐答案

相关文章

C#/.NET最新文章

热门教程

热门工具

登录关闭

iTextSharp库不从我的文件中提取文本 [英] iTextSharp library does not extract text from my file

问题描述

推荐答案

相关文章

C#/.NET最新文章

热门教程

热门工具

登录 关闭

登录关闭