如何从 PDF 中提取嵌入字体作为有效字体文件? [英] How can I extract embedded fonts from a PDF as valid font files?

查看:244
本文介绍了如何从 PDF 中提取嵌入字体作为有效字体文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我知道 pdftk.exe 实用程序可以指示 PDF 使用哪些字体,以及它们是否被嵌入.

I'm aware of the pdftk.exe utility that can indicate which fonts are used by a PDF, and wether they are embedded or not.

现在的问题是:鉴于我有带有嵌入字体的 PDF 文件——我怎样才能以一种可以作为常规字体文件重复使用的方式提取这些字体?是否有(最好是免费的)工具可以做到这一点?另外:这可以使用 iText 以编程方式完成吗?

Now the problem: given I had PDF files with embedded fonts -- how can I extract those fonts in a way that they are re-usable as regular font files? Are there (preferably free) tools which can do that? Also: can this be done programmatically with, say, iText?

推荐答案

您有多种选择.所有这些方法都适用于 Linux 以及 Windows 或 Mac OS X.但是,请注意,大多数 PDF 在嵌入字体时不包括完整的字体.大多数情况下,它们只包含文档中使用的字形的子集.

You have several options. All these methods work on Linux as well as on Windows or Mac OS X. However, be aware that most PDFs do not include to full, complete fontface when they have a font embedded. Mostly they include just the subset of glyphs used in the document.

在 *nix 系统上最常用的方法之一包括以下步骤:

One of the most frequently used methods to do this on *nix systems consists of the following steps:

  1. 将 PDF 转换为 PostScript,例如使用 XPDF 的 pdftops(在 Windows 上:pdftops.exe 帮助程序.
  2. 现在字体将以 .pfa (PostScript) 格式嵌入 + 您可以使用文本编辑器提取它们.
  3. 您可能需要使用 t1utils.pfa (ASCII) 转换为 .pfb(二进制)文件pfa2pfb.
  4. 在 PDF 中,从不嵌入 .pfm.afm 文件(字体度量文件)(因为 PDF 查看器对这些有内部知识).没有这些,字体文件就很难以视觉上令人愉悦的方式使用.
  1. Convert the PDF to PostScript, for example by using XPDF's pdftops (on Windows: pdftops.exe helper program.
  2. Now fonts will be embedded in .pfa (PostScript) format + you can extract them using a text editor.
  3. You may need to convert the .pfa (ASCII) to a .pfb (binary) file using the t1utils and pfa2pfb.
  4. In PDFs there are never .pfm or .afm files (font metric files) embedded (because PDF viewer have internal knowledge about these). Without these, font files are hardly usable in a visually pleasing way.


使用fontforge

另一种方法是使用免费字体编辑器FontForge:

  1. 使用打开文件时使用的打开字体"对话框.
  2. 然后在对话框的过滤器部分选择从 PDF 中提取".
  3. 选择包含要提取字体的 PDF 文件.
  4. 选择字体" 对话框打开 -- 在此处选择要打开的字体.
  1. Use the "Open Font" dialogbox used when opening files.
  2. Then select "Extract from PDF" in the filter section of dialog.
  3. Select the PDF file with the font to be extracted.
  4. A "Pick a font" dialogbox opens -- select here which font to open.

查看 FontForge 手册.您可能需要遵循一些不一定直接的特定步骤才能将提取的字体数据保存为可重复使用的文件.

Check the FontForge manual. You may need to follow a few specific steps which are not necessarily straightforward in order to save the extracted font data as a file which is re-usable.

接下来,MuPDF.此应用程序带有一个名为 pdfextract(在 Windows 上:pdfextract.exe)的实用程序,它可以从 PDF 中提取字体和图像.(如果您不了解 MuPDF,它仍然相对陌生和新颖:MuPDF 是一个免费的轻量级 PDF 查看器和工具包,用可移植的 C 语言编写.",由 Artifex Software 开发人员编写,给我们提供 Ghostscript 的同一家公司.)
(更新:MuPDF 的较新版本已将 'pdfextract' 的旧功能移至命令 'mutool extract'.下载它此处:mupdf.com/downloads)

Next, MuPDF. This application comes with a utility called pdfextract (on Windows: pdfextract.exe) which can extract fonts and images from PDFs. (In case you don't know about MuPDF, which still is relatively unknown and new: "MuPDF is a Free lightweight PDF viewer and toolkit written in portable C.", written by Artifex Software developers, the same company that gave us Ghostscript.)
(Update: Newer versions of MuPDF have moved the former functionality of 'pdfextract' to the command 'mutool extract'. Download it here: mupdf.com/downloads)

注意:pdfextract.exe 是一个命令行程序.要使用它,请执行以下操作:

Note: pdfextract.exe is a command-line program. To use it, do the following:

c:>  pdfextract.exe  c:path	ofilename.pdf         # (on Windows)
$>    pdfextract  /path/tofilename.pdf                # (on Linux, Unix, Mac OS X)

此命令会将所有可提取的文件从引用的 pdf 文件中转储到当前目录中.通常,您会看到各种文件:图像和字体.这些包括 PNG、TTF、CFF、CID 等.如果图像的 PDF 对象编号为 412,则图像名称将类似于 img-0412.png.字体名称将类似于 FGETYK+LinLibertineI-0966.ttf,如果字体的 PDF 对象编号为 966.

This command will dump all of the extractable files from the pdf file referenced into the current directory. Generally you will see a variety of files: images as well as fonts. These include PNG, TTF, CFF, CID, etc. The image names will be like img-0412.png if the PDF object number of the image was 412. The fontnames will be like FGETYK+LinLibertineI-0966.ttf, if the font's PDF object number was 966.

CFF(Compact Font Format)文件是一种公认​​的格式,可以通过各种转换器转换为其他格式,以便在不同的操作系统上使用.

CFF (Compact Font Format) files are a recognized format that can be converted to other formats via a variety of converters for use on different operating systems.

再次提醒:请注意,大多数这些字体文件可能只有一个子集字符,可能无法代表完整的字体.

Again: be aware that most of these font files may have only a subset of characters and may not represent the complete typeface.

更新:(2013 年 7 月)最近版本的 mupdf 对其二进制文件进行了内部改组和重命名,不止一次,而是多次.主要的实用程序曾经是一个类似于瑞士刀"的二进制文件,称为 mubusy(名称的灵感来自 busybox?),最近更名为 mutool.这些支持子命令 <​​code>infocleanextractpostershow.不幸的是,这些工具的官方文档(还)不是最新的.如果您在 Mac 上使用MacPorts":则该实用程序已重命名,以避免与其他使用相同名称的实用程序发生名称冲突,您可能需要使用 mupdfextract.

Update: (Jul 2013) Recent versions of mupdf have seen an internal reshuffling and renaming of their binaries, not just once, but several times. The main utility used to be a 'swiss knife'-alike binary called mubusy (name inspired by busybox?), which more recently was renamed to mutool. These support the sub-commands info, clean, extract, poster and show. Unfortunatey, the official documentation for these tools isn't up to date (yet). If you're on a Mac using 'MacPorts': then the utility was renamed in order to avoid name clashes with other utilities using identical names, and you may need to use mupdfextract.

要使用 mutool 获得(大致)等效的结果,就像它以前的工具 pdfextract 所做的那样,只需运行 mubusy extract ....*

To achieve the (roughly) equivalent results with mutool as its previous tool pdfextract did, just run mubusy extract ....*

因此要提取字体和图像,您可能需要运行以下命令行之一:

So to extract fonts and images, you may need to run one of the following commandlines:

c:>  mutool.exe extract filename.pdf      # (on Windows)
$>    mutool     extract filename.pdf      # (on Linux, Unix, Mac OS X)

下载地址:mupdf.com/downloads

Downloads are here: mupdf.com/downloads

然后,Ghostscript 也可以直接从 PDF 中提取字体.但是,它需要一个名为 extractFonts.ps,用 PostScript 语言编写,可从 Ghostscript 源代码库.

Then, Ghostscript can also extract fonts directly from PDFs. However, it needs the help of a special utility program named extractFonts.ps, written in PostScript language, which is available from the Ghostscript source code repository.

现在使用它,您需要同时运行这个文件 extractFonts.ps 和您的 PDF 文件.然后 Ghostscript 将使用 PostScript 程序的指令从 PDF 中提取字体.它在 Windows 上看起来像这样(是的,Ghostscript 理解正斜杠",/,在 Windows 上也可以作为路径分隔符!):

Now use it, you need to run both, this file extractFonts.ps and your PDF file. Ghostscript will then use the instructions from the PostScript program to extract the fonts from the PDF. It looks like this on Windows (yes, Ghostscript understands the 'forward slash', /, as a path separator also on Windows!):

gswin32c.exe                  ^
  -q -dNODISPLAY              ^
   c:/path/to/extractFonts.ps ^
  -c "(c:/path/to/your/PDFFile.pdf) extractFonts quit"

或在 Linux、Unix 或 Mac OS X 上:

or on Linux, Unix or Mac OS X:

gs                          
  -q -dNODISPLAY            
   /path/to/extractFonts.ps 
  -c "(/path/to/your/PDFFile.pdf) extractFonts quit"

我几年前测试过 Ghostscript 方法.当时它确实提取了 *.ttf (TrueType) 就好了.我不知道其他字体类型是否也将被提取,如果是,以可重用的方式提取.我不知道该实用程序是否会阻止提取标记为受保护的字体.

I've tested the Ghostscript method a few years ago. At the time it did extract *.ttf (TrueType) just fine. I don't know if other font types will also be extracted at all, and if so, in a re-usable way. I don't know if the utility does block extracting of fonts which are marked as protected.

最后,Didier Stevens 的 pdf-parser.py:这个可能没那么好用,因为你需要对PDF的内部结构有一定的了解.pdf-parser.py 是一个 Python 脚本,它也可以做很多其他的事情.它还可以从对象中解压和提取任意流,因此它也可以提取嵌入的字体文件.

Finally, Didier Stevens' pdf-parser.py: this one is probably not as easy to use, because you need to have some know-how about internal PDF structures. pdf-parser.py is a Python script which can do a lot of other things too. It can also decompress and extract arbitrary streams from objects, and therefore it can extract embedded font files too.

但是你需要知道要寻找什么.让我们用一个例子来看看.我有一个名为 big.pdf 的文件.作为第一步,我使用 -s 参数在 PDF 中搜索任何出现的关键字 FontFile (pdf-parser.py 不需要区分大小写的搜索):

But you need to know what to look for. Let's see it with an example. I have a file named big.pdf. As a first step I use the -s parameter to search the PDF for any occurrence of the keyword FontFile (pdf-parser.py does not require a case sensitive search):

pdf-parser.py -s fontfile big.pdf

就我而言,对于我的 big1.pdf,我得到以下结果:

In my case, for my big1.pdf, I get this result:

obj 9 0
 Type: /FontDescriptor
 Referencing: 15 0 R
  <<   
    /Ascent 728
    /CapHeight 716
    /Descent -210 
    /Flags 32
    /FontBBox [ -665 -325 2000 1006 ]
    /FontFile2 15 0 R
    /FontName /ArialMT
    /ItalicAngle 0
    /StemV 87
    /Type /FontDescriptor
    /XHeight 519
  >>   

obj 11 0 
 Type: /FontDescriptor
 Referencing: 16 0 R
  <<   
    /Ascent 728
    /CapHeight 716
    /Descent -210 
    /Flags 262176
    /FontBBox [ -628 -376 2000 1018 ]
    /FontFile2 16 0 R
    /FontName /Arial-BoldMT
    /ItalicAngle 0
    /StemV 165
    /Type /FontDescriptor
    /XHeight 519
  >>   

它告诉我 PDF 中有两个 FontFile2 实例,它们在 PDF 对象中.15 和没有.分别为 16 个.对象编号15 保存字体/ArialMT/FontFile2,对象编号.16 保存字体/Arial-BoldMT/FontFile2.

It tells me that there are two instances of FontFile2 inside the PDF, and these are in PDF objects no. 15 and no. 16, respectively. Object no. 15 holds the /FontFile2 for font /ArialMT, object no. 16 holds the /FontFile2 for font /Arial-BoldMT.

为了更清楚地展示这一点:

To show this more clearly:

pdf-parser.py -s fontfile big1.pdf | grep -i fontfile
  /FontFile2 15 0 R
  /FontFile2 16 0 R

快速浏览 PDF 规范会发现关键字 /FontFile2'包含 TrueType 字体程序的流' (/FontFile> 将与 '包含 Type 1 字体程序的流' 相关,/FontFile3 将与包含字体程序的 '流相关,其格式由流字典中的子类型条目' {因此是 Type1CCIDFontType0C 子类型}.)

A quick peeking into the PDF specification reveals the the keyword /FontFile2 relates to a 'stream containing a TrueType font program' (/FontFile would relate to a 'stream containing a Type 1 font program' and /FontFile3 would relate to a 'stream containing a font program whose format is specified by the Subtype entry in the stream dictionary' {hence being either a Type1C or a CIDFontType0C subtype}.)

要专门查看 PDF 对象编号.15(保存字体/ArialMT),可以使用-o 15参数:

To look specifically at PDF object no. 15 (which holds the font /ArialMT), one can use the -o 15 parameter:

pdf-parser.py -o 15 big1.pdf

 obj 15 0
  Type: 
  Referencing: 
  Contains stream
   <<
     /Length1 778552
     /Length 1581435
     /Filter /ASCIIHexDecode
   >>

这个 pdf-parser.py 输出告诉我们这个对象包含一个长度为 1.581.435 字节的流(它不会直接显示)并且被编码( == "压缩") 和 ASCIIHexEncode 并需要在标准 /ASCIIHexDecode 过滤器的帮助下进行解码(==解压缩"或过滤").

This pdf-parser.py output tells us that this object contains a stream (which it will not directly display) that has a length of 1.581.435 Bytes and is encoded ( == "compressed") with ASCIIHexEncode and needs to be decoded ( == "de-compressed" or "filtered") with the help of the standard /ASCIIHexDecode filter.

要从对象转储任何流,可以使用 -d dumpname 参数调用 pdf-parser.py.让我们做吧:

To dump any stream from an object, pdf-parser.py can be called with the -d dumpname parameter. Let's do it:

pdf-parser.py -o 15 -d dumped-data.ext big1.pdf

我们提取的数据转储将在名为 dumped-data.ext 的文件中.让我们看看它有多大:

Our extracted data dump will be in the file named dumped-data.ext. Let's see how big it is:

ls -l dumped-data.ext
  -rw-r--r--  1 kurtpfeifle  staff  1581435 Apr 11 00:29 dumped-data.ext

哦,看,它是 1.581.435 字节.我们在上一个命令的输出中看到了这个数字.使用文本编辑器打开此文件可确认其内容是 ASCII 十六进制编码数据.

Oh look, it is 1.581.435 Bytes. We saw this figure in the previous command's output. Opening this file with a text editor confirms that its content is ASCII hex encoded data.

使用像 otfinfo 这样的字体阅读工具打开文件(这是 lcdf-typetools) 一开始会让人有些失望:

Opening the file with a font reading tool like otfinfo (this is a part of the lcdf-typetools package) will lead to some disappointment at first:

otfinfo -i dumped-data.ext
  otfinfo: dumped-data.ext: not an OpenType font (bad magic number)

好的,这是因为我们(还)没有让 pdf-parser.py 使用它的全部魔法:转储过滤后的解码流.为此,我们必须添加 -f 参数:

OK, this is because we did not (yet) let pdf-parser.py make use of its full magic: to dump a filtered, decoded stream. For this we have to add the -f parameter:

pdf-parser.py -o 15 -f -d dumped-data-decoded.ext big1.pdf

这个新文件有多大?

ls -l dumped-data-decoded.ext
  -rw-r--r--  1 kurtpfeifle  staff  778552 Apr 11 00:39 dumped-data-decoded.ext

哦,看:那个确切的数字也已经存储在 PDF 对象编号中.15 字典作为键 /Length1...

Oh, look: that exact number was also already stored in the PDF object no. 15 dictionary as the value for key /Length1...

file 认为它是什么?

file dumped-data-decoded.ext
  dumped-data-decoded.ext: TrueType font data

otfinfo 告诉我们什么?

otfinfo -i dumped-data-decoded.ext
  Family:              Arial
  Subfamily:           Regular
  Full name:           Arial
  PostScript name:     ArialMT
  Version:             Version 5.10
  Unique ID:           Monotype:Arial Regular:Version 5.10 (Microsoft)
  Designer:            Monotype Type Drawing Office - Robin Nicholas, Patricia Saunders 1982
  Manufacturer:        The Monotype Corporation
  Trademark:           Arial is a trademark of The Monotype Corporation.
  Copyright:           © 2011 The Monotype Corporation. All Rights Reserved.
  License Description: You may use this font to display and print content as permitted by
                       the license terms for the product in which this font is included.
                       You may only (i) embed this font in content as permitted by the 
                       embedding restrictions included in this font; and (ii) temporarily 
                       download this font to a printer or other output device to help
                       print content.
  Vendor ID:           TMC

Bingo!,我们有一个赢家:pdf-parser.py 确实为我们提取了一个有效的字体文件.鉴于此文件的大小(778.552 字节),看起来此字体甚至已完全嵌入到 PDF 中...

So Bingo!, we have a winner: pdf-parser.py did indeed extract a valid font file for us. Given the size of this file (778.552 Bytes), it looks like this font had been embedded even completely in the PDF...

我们可以将其重命名为 arial-regular.ttf 并按原样安装并愉快地使用它.

We could rename it to arial-regular.ttf and install it as such and happily make use of it.

  • 在任何情况下,您都需要遵循适用于该字体的许可.某些字体许可证不允许免费使用和/或分发.盗版字体就像盗版任何软件或其他受版权保护的材料.

大多数流行的 PDF 无论如何都没有嵌入完整的字体,而只是嵌入了子集.提取字体的子集仅在非常有限的范围内有用.

请同时阅读以下有关字体提取工作的优点和(更多)缺点的内容:

Please do also read the following about Pros and (more) Cons regarding font extraction efforts:

  • http://typophile.com/node/34377 — not available anymore, but can bee seen on Wayback Machine at https://web.archive.org/web/20110717120241/typophile.com/node/34377

这篇关于如何从 PDF 中提取嵌入字体作为有效字体文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆