从PDF复制粘贴文本会导致垃圾回收 [英] Copy+pasting text from PDF results in garbage

查看:140
本文介绍了从PDF复制粘贴文本会导致垃圾回收的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在写硕士论文-NLP系统.我只有一个组件-提取器.

I am writing a Master's thesis - NLP system. I have one component - extractor.

它正在从PDF文件中提取纯文本.有一些无法正确提取的PDF文件.提取程序(PDFBox库)返回这样的字符串:

It is extracting a plain text from PDF files. There are a few PDF files that can not be extracted correctly. Extractor (PDFBox library) returns a string like this:

┤xDn║if|d├gDF" Ti&cD╬lhdFÁhis〜n dxd f«d┤ffih»h"

"┤xDn║if|d├gDF"Ti&cD╬lh d FÁhis~n ╗xd f«"d┤ffih »h"

"10a61a91a22a25a3a27a17a23a20a8a13a14a61a25a17"

"10a61a91a22a25a3a27a17a23a20a8a13a14a61a25a17"

我正在检查造成提取问题的每个文件,并且所有这些文件的文本也无法从PDF Reader(Adobe Reader和FoxIt reader)中复制粘贴.启用了在此阅读器中查看它们的功能,但是在选择其内容并将其复制到剪贴板后,我得到了相同的错误文本(如上所述-字符串在语义上不正确,字符或数字和字母在字符串中).

I was checking each file that makes this extraction's problem and all these files' text also can not be copy-pasted from PDF Reader (Adobe Reader and FoxIt reader). Viewing them in this readers is enabled, but after selecting its content and copying to the clipboard I get the same wrong text (as described above - strings of not semantically correct chars or strings of digits and letters).

有人可以帮助我吗?

推荐答案

在这种情况下,通常无法选择从Acrobat(阅读器)窗口中复制粘贴文本的情况,这时可能会有另一种选择仍然可以工作:

Very often in such cases, where you can't select, copy'n'paste text from the Acrobat (Reader) window, there is another option which may work nevertheless:

  • 打开文件" 菜单,
  • 选择另存为..."
  • 选择文本(普通)(*.txt)"
  • 浏览到目标目录,
  • 键入您要用于文本文件的名称.
  • Open 'File' menu,
  • select 'Save as...',
  • select 'Text (normal) (*.txt)',
  • browse to the target directory,
  • type the name you want to use for the text file.

您将拥有文件中所有页面的所有文本,并且需要找到要初始复制的位置,因为它不如直接复制的舒适.但是它更可靠地工作....

You'll have all text from all pages in the file and need to locate the spot you wanted to copy'n'paste initially -- insofar it is not as comfortable as direct copy'n'paste. But it works more reliably....

它在Linux上也可以与acroread一起使用(但是您必须从文件菜单中选择另存为文本..." ).

It also works with acroread on Linux (but you have to choose 'Save as text...' from the file menu).

您可以使用pdffonts命令行实用工具来快速分析PDF所使用的字体.

You can use the pdffonts command line utility to get a quick-shot analysis of the fonts used by a PDF.

这是示例输出,该示例演示了很可能在何处发生文本提取问题.它使用来自 GitHub存储库 ,其创建目的是提供带有注释且可以在文本编辑器中轻松打开的PDF示例文件:

Here is an example output, which demonstrates where a problem for text extraction will very likely occur. It uses one of these hand-coded PDF files from a GitHub-Repository which was created to provide PDF sample files which are well commented and may easily be opened in a text editor:

$ pdffonts  textextract-bad2.pdf
  name                            type         encoding    emb sub uni object ID
  ------------------------------- ------------ ----------- --- --- --- ---------
  BAAAAA+Helvetica                TrueType     WinAnsi     yes yes yes     12  0
  CAAAAA+Helvetica-Bold           TrueType     WinAnsi     yes yes no      13  0

如何解释此表?

  • 上面的PDF文件使用两个子集的字体(如名称的BAAAAA+CAAAAA+前缀以及sub列中的yes条目所示),HelveticaHelvtica-Bold.
  • 两种字体的类型均为TrueType.
  • 两种字体均使用WinAnsi编码(一种字体编码将PDF源代码中使用的char标识符映射到应绘制的字形). 但是,仅对于字体/Helvetica,PDF内有一个/ToUnicode表可用(对于/Helvetica-Bold没有任何表),如uni列中的yes/no所示. /li>
  • The above PDF file uses two subsetted fonts (as indicated by the BAAAAA+ and CAAAAA+ prefixes to their names, as well as by the yes entries in the sub column), Helvetica and Helvtica-Bold.
  • Both fonts are of type TrueType.
  • Both fonts use a WinAnsi encoding (a font encoding maps char identifiers used in the PDF source code to glyphs that should be drawn). However, only for font /Helvetica there is a /ToUnicode table available inside the PDF (for /Helvetica-Bold there is none), as indicated by the yes/no in the uni-column).

/ToUnicode表是必需的,以提供从字符标识符/代码到字符的反向映射.

The /ToUnicode table is required to provide a reverse mapping from character identifiers/codes to characters.

缺少特定字体的/ToUnicode表几乎总是可以确保不能使用PDF提取或复制使用该字体的文本字符串. (即使/ToUnicode在那里,文本提取仍可能会带来问题,因为此表可能已损坏,不正确或不完整-如许多实际PDF文件中所示,上面链接的GitHub存储库中的一些伴随文件也对此进行了演示.)

A missing /ToUnicode table for a specific font is almost always a sure indicator that text strings using this font cannot be extracted or copied'n'pasted from the PDF. (Even if a /ToUnicode table is there, text extraction may still pose a problem, because this table may be damaged, incorrect or incomplete -- as seen in many real-world PDF files, and as also demonstrated by a few companion files in the above linked GitHub repository.)

这篇关于从PDF复制粘贴文本会导致垃圾回收的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆