PDF复制文本问题:奇怪的字符 [英] PDF Copy Text Issue: Weird Characters

查看:379
本文介绍了PDF复制文本问题:奇怪的字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图从PDF文件复制文本,但得到一些奇怪的字符.奇怪的是,Okular可以重新识别文本,但不能使用Sumatra PDF或Adobe,这三个应用程序都安装在Windows 10 64位中.为了更好地解释我的问题,以下是视频 https://streamable.com/sw1hc . 文本层解决方法文件"是我得到的一种解决方案.任何帮助是极大的赞赏.问候

简而言之:(原始)PDF不包含PDF规范中描述的常规文本提取所需的信息.根据任务的确切性质,您可以尝试将所需的信息添加到现有的文本对象和字体中,或者可以使用OCR.

按照PDF规范中的描述将字符代码映射为Unicode

PDF规范ISO 32000-1(以及类似的ISO 32000-2)描述了一种算法,该算法使用直接在PDF内部可用的信息将字符代码映射到Unicode值.

在其他堆栈溢出答案中经常引用它(请参见此处此处此处在此),因此在此不再赘述.

从本质上讲,这是Adobe Acrobat在复制和粘贴期间以及许多其他文本提取器中使用的算法.

在不包含文本提取所需信息的PDF中,您最终会在算法中达到这一点:

如果这些方法无法产生Unicode值,则无法确定字符代码代表什么,在这种情况下,合格的读者可以选择自己选择的字符代码.

如果上述算法无法产生Unicode值会发生什么

这是文本提取实现的不同之处,他们尝试通过使用PDF之外的启发式方法或信息或将OCR应用于有问题的字形来确定匹配的Unicode值.

您尝试过的不同程序返回的结果如此不同

  1. 您的PDF不包含PDF规范和

  2. 中上述算法所需的信息 这些程序使用的
  3. 启发式方法相关地有所不同,并且Okular的启发式方法最适合您的文档.

在这种情况下该怎么办

有多种选择,根据您的具体情况或多或少是可行的:

  1. 向PDF来源询问包含正确文本提取信息的版本.

    除非您与该来源有合同,要求他们以机器可读的形式提供PDF,否则该来源通常有义务这样做,尽管如此...

  2. 将OCR应用于相关PDF.

    取决于OCR软件的质量和PDF中的字形,结果的质量可能令人怀疑;例如在您的"PDF复制文本问题-Text layer workaround.pdf"标题中,第1章:衍生证券"被识别为第1章:Deratve Securites" ...

  3. 您可以尝试以交互方式将手动创建的 ToUnicode 映射添加到PDF,例如如 Tilman Hausherr 所述>对如何在pdfbox 2.0.0上的truetype0font中添加unicode"的答案.

    >

    根据创建映射所必须使用的不同字体的数量,此方法可能很容易需要太多时间和精力...

I tried to copy text from a PDF file but get some weird characters. Strangely, Okular can recoqnize the text, but not with Sumatra PDF or Adobe, all three applications are installed in Windows 10 64 bit. To better explain my issue, here is the video https://streamable.com/sw1hc. The "text layer workaround file" is one solution I got. Any help is greatly appreciated. Regards

解决方案

In short: The (original) PDF does not contain the information required for regular text extraction as described in the PDF specification. Depending on the exact nature of your task, you might try to add the required information to the existing text objects and fonts or you might go for OCR.

Mapping character codes to Unicode as described in the PDF specification

The PDF specification ISO 32000-1 (and similarly ISO 32000-2, too) describes an algorithm for mapping character codes to Unicode values using information available directly inside the PDF.

It has been quoted very often in other stack overflow answers (see here, here, here, here, here, or here), so I won't quote it here again.

Essentially this is the algorithm used by Adobe Acrobat during copy&paste and also by many other text extractors.

In PDFs which don't contain the information required for text extraction, you eventually get to this point in the algorithm:

If these methods fail to produce a Unicode value, there is no way to determine what the character code represents in which case a conforming reader may choose a character code of their choosing.

What happens if the algorithm above fails to produce a Unicode value

This is where the text extraction implementations differ, they try to determine the matching Unicode value by using heuristics or information from beyond the PDF or applying OCR to the glyph in question.

That the different programs you tried returned so different results shows that

  1. your PDF does not contain the information required for the algorithm above from the PDF specification and

  2. the heuristics used by those programs differ relevantly and Okular's heuristics work best for your document.

What to do in such a case

There are multiple options, more or less feasible depending on your concrete case:

  1. Ask the source of the PDF for a version that contains proper information for text extraction.

    Unless you have a contract with that source that requires them to supply the PDFs in a machine readable form or the source is otherwise obligated to do so, they usually will decline, though...

  2. Apply OCR to the PDF in question.

    Depending on the quality of the OCR software and the glyphs in the PDF, the results can be of a questionable quality; e.g. in your "PDF copy text issue-Text layer workaround.pdf" the header "Chapter 1: Derivative Securities" has been recognized as "Chapter1: Deratve Securites"...

  3. You can try to interactively add manually created ToUnicode maps to the PDF, e.g. as described by Tilman Hausherr in his answer to "how to add unicode in truetype0font on pdfbox 2.0.0".

    Depending on the number of different fonts you have to create the mappings for, this approach might easily require way too much time and effort...

这篇关于PDF复制文本问题:奇怪的字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆