为什么将 PDF 转换为纯文本如此困难? [英] Why is it so hard to convert PDF to plain text?

查看:40
本文介绍了为什么将 PDF 转换为纯文本如此困难?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要将一些 PDF 转换回文本.我尝试了许多软件和在线工具,结果总是平庸.

I needed to convert some PDF back to text. I tried many soft and online tools and result was always mediocre.

从技术上讲,为什么这么难?

Why is it so difficult technically speaking ?

推荐答案

我们不要假设您在谈论仅包装一些位图图像的 PDF,因为应该清楚,在这种情况下,您只能求助于 OCR 及其所有限制.

Let's not assume you are talking about PDFs which merely wrap some bitmap image because it should be clear that in that case you can only resort to OCR with all its restrictions.

让我们假设文本是在手头的 PDF 中绘制的.

Let's instead assume that text is drawn in the PDF at hand.

在 PDF 页面上绘制的内容由该页面内容流中的指令序列决定.页面上的绘制文字"是指在这些指令中,有的是设置后面指令使用的字体,有的是设置后面指令使用的文字位置和方向,还有一些实际绘制的文字是字符串参数".

What is drawn on a PDF page is determined by a sequence of instructions in the content stream of that page. "Text is drawn" on a page means that among those instructions there are some setting the font to use by the instructions to come, some setting the text position and direction to use by the instructions to come, and some actually drawing text given by "string arguments".

文本提取是从内容流中获取指令序列的任务,而不是绘制字体和位置设置指令指示的文本,而是使用标准编码,通常是使用的编程语言/平台的字符类型的编码.

Text extraction is the task of taking the sequence of instructions from a content stream and instead of drawing the text as indicated by the font and position setting instructions, to export it in a sensible order using a standard encoding, usually the encoding of the character type of the used programming language / platform.

第一个问题是理解那些文本绘制指令的字符串参数的编码:

The first problem is to understand the encoding of the string arguments of those text drawing instructions:

  • 每种字体都可以有自己的编码;要提取文本,除了绘制文本和连接它们的字符串内容的指令之外,不能简单地忽略所有内容,您必须始终考虑当前字体(一些非常简单的文本提取器会忽略这一点,因此经常无法返回一些合理的内容);

  • each font can have its own encoding; to extract the text one cannot simply ignore everything but the instructions drawing text and concatenate their string contents, you always have to take the current font into account (some extremely simple text extractors ignore this and, therefore, fail pretty often to return something sensible);

有大量的预定义编码,一些你知道的编码的提醒,例如WinAnsiEncoding,很多你可能不知道,例如添加-RKSJ-H;这些编码可以为每个字形使用固定数量的字节,也可以是混合多字节;所以文本提取器必须支持非常多的编码;

there are a large number of predefined encodings, some reminding of encodings you know, e.g. WinAnsiEncoding, many you likely don't know, e.g. Add-RKSJ-H; these encodings may use a constant number of bytes per glyph or they may be mixed-multibyte; so a text extractor must support very many encodings to start with;

编码也可能是完全临时的和任意的;特别是在嵌入子集字体的情况下,人们经常会看到通过在需要时从某个起始值处理字符代码而生成的临时编码;即页面上使用的给定字体中的第一个字形被赋予起始值作为代码,下一个不同的字形被赋予起始值加一,下一个不同的字形被赋予起始值加二,依此类推;Hello World"和起始值 48(ASCII 值0")将导致01223453627";这些字体可能包含到 Unicode 的映射,但它们不是必需的.

encodings also may be completely ad-hoc and arbitrary; in particular in case of embedded subset fonts one often sees ad-hoc encodings generated by dealing out character codes from some starting value whenever one is needed; i.e. the first glyph in a given font used on a page is given the starting value as code, the next, different glyph is given the starting value plus one, the next, different one the starting value plus two, etc; "Hello World" and a starting value of 48 (ASCII value of '0') would result in "01223453627"; these fonts may contain a mapping to Unicode but they are not required to.

下一个问题是弄明白字符串的顺序:

The next problem is to make sense out of the order of the strings:

  • 字符串绘制指令可能以任意顺序出现,例如Hello"可能先绘制lo",然后移回el",然后再移回H";要提取文本,不能忽略文本定位指令并简单地连接文本字符串,您必须始终考虑当前位置(一些简单的文本提取器会忽略这一点,因此可能无法返回合理的内容);

  • the string drawing instructions may occur in an arbitrary order, e.g "Hello" might be drawn "lo" first, then after moving back "el", then after again moving back "H"; to extract the text one cannot ignore text positioning instructions and simply concatenate text strings, you always have to take the current position into account (some simple text extractors ignore this and, therefore, can fail to return something sensible);

多栏文本可能会带来困难,文本可能会逐行绘制,例如首先是第一列第一行的文本,然后是第二列的第一行,然后是第一列的第二行,然后是第二列的第二行,依此类推;PDF 中不需要任何提示文本是多列的.

multi-columnar text may present a difficulty, text may be drawn line by line, e.g. first the text of the top line of the first column, then the top line of the second column, then the second line of the first column, then the second line of the second column, etc.; there need not be any hints in the PDF that the text is multi-columnar.

另一个问题是识别格式或样式工件:

Another problem is to recognize formatting or styling artifacts:

  • 单词之间的空格不需要通过绘制空格字形来创建,也可以通过文本位置更改指令来完成;不尝试识别由文本定位指令造成的间隙的文本提取器可能会返回没有空格的结果;另一方面,可以使用相同的技术以最佳距离绘制相邻字形,即字距调整;试图识别由文本定位指令造成的间隙的文本提取器可能会错误地返回不应该存在的空格;

  • spaces between words need not be created by drawing a space glyph, it may also be done by text position changing instructions; text extractors not trying to recognize gaps created by text positioning instructions may return a result without spaces; on the other hand the same technique can be used to draw adjacent glyphs at an optimal distance, aka kerning; text extractors trying to recognize gaps created by text positioning instructions may falsely return spaces where there should be none;

有时会打印选定的单词 s p a c e d o u t 以加强强调;在提取的文本中,这些间隙可能会显示为空格字符,文本的自动后处理可能会将其视为单词分隔符;

sometimes selected words are printed s p a c e d o u t for extra emphasis; in the extracted text these gaps might be presented as space characters which automatic postprocessing of the text may see as word separators;

通常对于粗体文本使用不同的粗体程序;如果这不是手头,人们有时会通过打印两次相同的文本以一分钟的偏移量来发挥创造力并模仿粗体;使用稍大的偏移(或不同的变换)和不同的颜色,可以模拟阴影效果;如果文本提取器没有尝试识别这一点,您最终会在输出中出现一些重复的字符.

usually for bold text one uses a different, bold font program; if that is not at hand, people sometimes get creative and emulate bold by printing the same text twice with a minute offset; with a slightly larger offset (or a different transformation) and a different color a shadow effect can be emulated; if the text extractor does not try to recognize this, you end up having some duplicate characters in the output.

由于额外信息不完整或错误会导致更多问题:

More problems arise due to incomplete or wrong extra information:

  • ToUnicode 字体映射(从字符代码到 Unicode 的可选映射)可能不完整或包含错误;那里例如这里有很多关于堆栈溢出的问题,处理印度文字的不正确 ToUnicode 映射;文本提取结果反映了这些错误;

  • ToUnicode maps of fonts (optional maps from character code to Unicode) may be incomplete or contain errors; there e.g. are many questions here on stack overflow dealing with incorrect ToUnicode maps for Indian writings; the text extraction results reflect these errors;

甚至还有包含相互矛盾信息的 PDF,例如ToUnicode 映射中有错误,但 ActualText 条目中有正确的信息;一些 PDF 创建者使用它来允许从某些程序中正确复制和粘贴(在这种情况下更喜欢 ActualText 条目),同时在其他程序的输出中注入错误(更喜欢 ToUnicode强>信息).

there even are PDFs with contradictory information, e.g. with an error in the ToUnicode map but the correct information in an ActualText entry; this is used by some PDF creators to allow correct copy&paste from some programs (preferring an ActualText entry in such a situation) while injecting errors in the output of other programs (preferring ToUnicode information then).

如果您希望文本提取器仅提取最终在页面中可见的文本,则会出现另一个问题:

Yet another problem arises if you expect the text extractor to extract only text eventually visible in the page:

  • 文本可能绘制在当前剪切区域外或可见页面区域外;文本提取器需要记住这些;

  • text may be drawn outside the current clipping area or outside the visible page area; text extractors need to keep these in mind;

可以使用不可见"渲染模式绘制文本;文本提取器必须注意渲染模式;

text may be drawn using the rendering mode "invisible"; text extractors have to keep an eye on the rendering mode;

可以使用与背景相同的颜色绘制文本;要认识到这一点,文本提取器不仅可以查看当前指令和一些图形状态细节,还必须考虑在文本位置预先绘制的任何内容;

text may be drawn using the same color as the background; to recognize this, a text extractor can not only look at the current instruction and a few graphics state details, it has to take into account anything drawn beforehand in the location of the text;

文本可以作为剪辑路径绘制;要识别此文本最终是否可见,只要剪辑路径处于活动状态,文本提取器就必须跟踪文本区域中绘制的内容;

text may be drawn as a clip path; to recognize whether this text is visible in the end, a text extractor must keep track of what is drawn in the text area as long as the clip path is active;

文本稍后可能会被其他内容覆盖;在这种情况下,文本提取器必须删除已识别的文本;但取决于混合模式和透明度设置,这些覆盖物可能会或可能不会让文本透出;因此,为了获得正确的结果,文本提取器必须为每个字形跟踪其绘制的颜色、背景的颜色以及所有这些漂亮的效果稍后对这些颜色的作用;当然,字形颜色和背景颜色都可能很有趣,例如一些底纹颜色;并且涉及的颜色空间可能不同,需要在颜色空间之间来回转换;等等.

text may be covered by something else later; a text extractor must drop recognized text in such a case; but depending on blend modes and transparency settings these coverings might or might not allow the text to shine through; thus, for a correct result the text extractor must for each glyph keep track of the color its drawn with, the color of the backdrop, and what all those spiffy effects do with those colors later on; and of course, both glyph color and backdrop color can be interesting, e.g. some shading colors; and the color spaces involved may differ, requiring one to convert back and forth between color spaces; and so on.

此外,可能会在文本提取器通常看不到的地方绘制文本:

Furthermore, text may be drawn where text extractors usually don't look:

  • 一些工具通过将文本放入一个模式并用该模式填充页面区域来隐藏文本提取中的文本;
  • 类似的还有type 3字体;type 3 字体中的每个字符都由其自己的内容流表示;因此,工具可以在单一类型 3 字体字形的内容流中绘制所有文本,然后在页面上绘制该字形.

...

您肯定已经知道为什么文本提取结果可能不是最佳的.请放心,上面的列表并不完整,文本提取还有更多的复杂性.

You surely have meanwhile gotten an idea why text extraction results can be less than optimal. And be assured, the list above is not complete, there still are more complications for text extraction.

这篇关于为什么将 PDF 转换为纯文本如此困难?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆