我怎样才能从PDF提取嵌入的字体作为有效的字体文件? [英] How can I extract embedded fonts from a PDF as valid font files?

查看:5163
本文介绍了我怎样才能从PDF提取嵌入的字体作为有效的字体文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我知道 pdftk.exe 这个工具可以指出哪些字体被PDF使用,而且它们是否被嵌入。



现在的问题是:由于我的PDF文件中含有嵌入的字体,我怎样才能将这些字体以可重复使用的方式作为常规的字体文件进行提取?有没有(最好是免费)的工具可以做到这一点?另外:这可以通过编程方式完成,比如说iText?

解决方案

所有这些方法都可以在Linux以及Windows或Mac OS X上运行。但是,请注意,大多数PDF在嵌入字体时不包含完整的完整字体。它们大多只包含文档中使用的字形的子集。




使用

在nix系统上使用最常用的方法之一是由以下步骤组成:


  1. 将PDF转换为PostScript,例如使用XPDF的 pdftops (在Windows上: pdftops.exe 帮助程序。

  2. 现在字体将嵌入 .pfa (PostScript)格式中,您可以使用文本编辑器
  3. 您可能需要将 .pfa (ASCII)转换为 .pfb (binary)使用 t1utils pfa2pfb
  4. 的文件。 b $ b
  5. 在PDF中,永远不会存在 .pfm .afm 文件(字体指标文件)因为PDF查看器有关于这些的内部知识) e,字体文件在视觉上不太好用。
  6. c $ c> fontforge



另一种方法是使用Free字体编辑器 FontForge


  1. 使用打开文件时使用的字体对话框。

  2. 然后在对话框的过滤器部分选择从PDF中提取。 $ b
  3. 选择要提取的字体的PDF文件。

  4. A 选取字体打开对话框 - 选择要打开的字体。

检查FontForge手册。您可能需要遵循一些不一定直接的特定步骤,以便将提取的字体数据保存为可重复使用的文件。




使用 mupdf



接下来, MuPDF 。此应用程序附带一个名为 pdfextract (在Windows上: pdfextract.exe )的实用程序,可以从PDF中提取字体和图像。 (如果你不知道MuPDF,这个还是比较陌生和陌生的:MuPDF是一个免费的轻量级PDF阅读器和工具包,用可移植的C语言编写。由Artifex Software开发者编写,同样的公司给了我们Ghostscript。)

更新:)更新版本的MuPDF已经将'pdfextract'命令'mutool extract'。在这里下载: mupdf.com/downloads pdfextract.exe 是一个命令行程序。要使用它,请执行以下操作:

  c:\> pdfextract.exe c:\path\to\filename.pdf(在Windows上)
$> pdfextract /path/tofilename.pdf#(在Linux,Unix,Mac OS X上)

从引用到当前目录的PDF文件转储所有可解压文件。通常你会看到各种文件:图像以及字体。这些包括PNG,TTF,CFF,CID等。如果图像的PDF对象编号为412,则图像名称将像 img-0412.png 。字体名称将像 FGETYK + LinLibertineI-0966.ttf ,如果字体的PDF对象编号为966.



CFF( Compact Font Format )文件一种可以通过各种转换器转换为其他格式的公认格式,用于不同的操作系统。



再说一次:请注意,大多数这些字体文件可能只有一个字符子集,并不能代表完整的字体。 >

更新:(2013年7月)最近版本的 mupdf 已经看到内部重新洗牌和重命名他们的二进制文件,不只是一次,而是几次。主要的工具曾经是一个名为 mubusy (名字受busybox启发)的类似瑞士刀的二进制文件,最近更名为 mutool 。这些支持子命令 info clean , extract 海报 show 。不幸的是,这些工具的官方文档尚未更新(尚)。如果您使用MacPorts在Mac上,那么该实用程序将被重命名以避免与使用相同名称的其他实用程序发生名称冲突,您可能需要使用 mupdfextract
$ b $ p

为了达到与 mutool 相当的结果,它的前一个工具 pdfextract did,只需运行 mubusy extract ... 。*



和图像,您可能需要运行以下命令行之一:

  c:\> mutool.exe extract filename.pdf#(在Windows上)
$> mutool extract filename.pdf#(在Linux,Unix,Mac OS X上)

下载地址: strong> mupdf.com/downloads



< hr>

使用 gs (Ghostscript)



然后, Ghostscript 也可以直接从PDF中提取字体。但是,它需要一个名为 extractFonts.ps ,可从 Ghostscript源代码库



<现在使用它,你需要运行这个文件 extractFonts.ps 和你的PDF文件。 Ghostscript将使用PostScript程序中的指令从PDF中提取字体。它看起来像这样在Windows上(是的,Ghostscript理解正斜杠,/,在Windows上也是一个路径分隔符!):

pre $ gswin32c.exe ^
-q -dNODISPLAY ^
c:/path/to/extractFonts.ps ^
-c(c:/path/to/your/PDFFile.pdf )extractFonts退出

或在Linux,Unix或Mac OS X上:

  gs \ 
-q -dNODISPLAY \
/path/to/extractFonts.ps \
-c (/path/to/your/PDFFile.pdf)extractFonts退出

我测试了几年前的Ghostscript方法。当时它提取* .ttf(TrueType)就好了。我不知道其他字体类型是否也将被提取,如果是的话,以可重用的方式。我不知道该实用程序是否阻止提取标记为受保护的字体。






使用 pdf-parser.py



最后,Didier Stevens的 pdf-parser.py :这个可能不太容易使用,因为你需要掌握一些关于内部的知识PDF结构。 pdf-parser.py 是一个Python脚本,它也可以做很多其他的事情。它也可以从对象中解压缩和提取任意流,因此也可以提取嵌入的字体文件。



但是你需要知道要查找什么。我们来看一个例子。我有一个名为 big.pdf 的文件。作为第一步,我使用 -s 参数在PDF中搜索任何出现的关键字 FontFile code> pdf-parser.py 不需要区分大小写的搜索):

  pdf-parser.py -s fontfile big.pdf 

对我来说,对于我的 big1 .pdf ,我得到这个结果:

pre $ 9 $
类型:/ FontDescriptor
参考:15 0 R
<<
/ Ascent 728
/ CapHeight 716
/ Descent -210
/ Flags 32
/ FontBBox [-665 -325 2000 1006]
/ FontFile2 15 0 R
/ FontName / ArialMT
/ ItalicAngle 0
/ StemV 87
/ Type / FontDescriptor
/ XHeight 519
>>

obj 11 0
类型:/ FontDescriptor
引用:16 0 R
<<
/ Ascent 728
/ CapHeight 716
/ Descent -210
/ Flags 262176
/ FontBBox [-628 -376 2000 1018]
/ FontFile2 16 0 R
/ FontName / Arial-BoldMT
/ ItalicAngle 0
/ StemV 165
/ Type / FontDescriptor
/ XHeight 519
>>

它告诉我有两个 FontFile2 / ArialMT 的 / FontFile2 ,object no。 16保存字体 / Arial-BoldMT / FontFile2



更清楚地表明:

  pdf-parser.py -s fontfile big1.pdf | grep -i fontfile 
/ FontFile2 15 0 R
/ FontFile2 16 0 R

快速浏览PDF规范揭示了关键字 / FontFile2 与包含TrueType字体程序的'流相关 / FontFile 将与包含Type 1字体程序的'流相关 / FontFile3 一个包含一个字体程序的流,该字体程序的格式是由流字典中的子类型条目指定的(因此,它既可以是Type1C,也可以是CIDFontType0C 子类型}。)



专门查看PDF对象编号。 15(包含字体 / ArialMT ),可以使用 -o 15 参数:

  pdf-parser.py -o 15 big1.pdf 

obj 15 0
类型:
引用:
包含流
<<
/ Length1 778552
/ Length 1581435
/ Filter / ASCIIHexDecode
>>

这个 pdf-parser.py 我们这个对象包含一个长度为1.581.435字节的流(它不会直接显示),并用ASCIIHexEncode进行编码(==压缩),需要解码(==解压缩或过滤)在标准 / ASCIIHexDecode 过滤器的帮助下。



可以使用 -d dumpname 参数调用 pdf-parser.py 。让我们来做:

  pdf-parser.py -o 15 -d dumped-data.ext big1.pdf 

我们提取的数据转储将位于名为 dumped-data.ext 的文件中。让我们看看它有多大:

  ls -l dumped-data.ext 
-rw-r - r - 1 kurtpfeifle staff 1581435 Apr 11 00:29 dumped-data.ext

哦,看,它是1.581.435字节。我们在上一个命令的输出中看到了这个数字。使用文本编辑器打开此文件,确认其内容是ASCII十六进制编码的数据。



使用字体阅读工具(如 otfinfo (这是 lcdf-typetools

  otfinfo -i 

dumped-data.ext
otfinfo:dumped-data.ext:不是OpenType字体(坏的幻数)

好的,这是因为我们还没有让 pdf-parser.py 充分利用它的魔力:转储过滤的解码流。为此,我们必须添加 -f 参数:

  pdf-parser.py -o 15 -f -d dumped-data-decoded.ext big1.pdf 

这个新文件的大小是多少?

  ls -l dumped-data-decoded.ext 
-rw-r - r-- 1 kurtpfeifle staff 778552 Apr 11 00:39 dumped-data-decoded.ext

哦,看:确切的数字也已经存储在PDF对象中。 15字典作为键的值 / Length1 ...



什么是文件认为它是?
$ b $ pre $ file dump-data-decoded.ext
dumped-data- decode.ext:TrueType字体数据

otfinfo 给我们介绍一下吗?

$ $ $ $ $ $ $ $ $ $ $ $ $ $ b Subfamily:Regular
全名:Arial
PostScript名称:ArialMT
版本:版本5.10
唯一ID:Monotype:Arial Regular:版本5.10(微软)
设计师:Monotype Type Drawing Office - Robin Nicholas,Patricia Saunders 1982
制造商:The Monotype Corporation
商标:Arial是The Monotype Corporation的商标。
版权所有:©2011 The Monotype Corporation。版权所有。
许可证说明:您可以使用此字体来显示和打印内容,如
所允许的包含此字体的产品的许可条款。
您只可以(i)将此字体嵌入到该字体中包含的
嵌入限制允许的内容中; (ii)暂时
将此字体下载至打印机或其他输出设备以帮助
打印内容。
卖家ID:TMC



所以Bingo !,我们有一个赢家: pdf-parser.py 的确为我们提取了一个有效的字体文件。考虑到这个文件的大小(778.552字节),它看起来像这个字体甚至完全嵌入在PDF中...



我们可以将它重新命名为 arial-regular.ttf ,然后像这样安装它,并很高兴地使用它。



$ hr

警告:




  • 无论如何,您需要遵循适用于字体。某些字体许可证不允许免费使用和/或分发。盗版字体就像盗版任何软件或其他受版权保护的材料一样。

  • 野外那里不嵌入完整的字体,但只有子集。提取字体的一个子集只在一个非常有限的范围内有用,如果有的话。


请仔细阅读以下有关优点和(更多)缺点的字体提取努力:


I'm aware of the pdftk.exe utility that can indicate which fonts are used by a PDF, and wether they are embedded or not.

Now the problem: given I had PDF files with embedded fonts -- how can I extract those fonts in a way that they are re-usable as regular font files? Are there (preferably free) tools which can do that? Also: can this be done programmatically with, say, iText?

解决方案

You have several options. All these methods work on Linux as well as on Windows or Mac OS X. However, be aware that most PDFs do not include to full, complete fontface when they have a font embedded. Mostly they include just the subset of glyphs used in the document.


Using pdftops

One of the most frequently used methods to do this on *nix systems consists of the following steps:

  1. Convert the PDF to PostScript, for example by using XPDF's pdftops (on Windows: pdftops.exe helper program.
  2. Now fonts will be embedded in .pfa (PostScript) format + you can extract them using a text editor.
  3. You may need to convert the .pfa (ASCII) to a .pfb (binary) file using the t1utils and pfa2pfb.
  4. In PDFs there are never .pfm or .afm files (font metric files) embedded (because PDF viewer have internal knowledge about these). Without these, font files are hardly usable in a visually pleasing way.


Using fontforge

Another method is to use the Free font editor FontForge:

  1. Use the "Open Font" dialogbox used when opening files.
  2. Then select "Extract from PDF" in the filter section of dialog.
  3. Select the PDF file with the font to be extracted.
  4. A "Pick a font" dialogbox opens -- select here which font to open.

Check the FontForge manual. You may need to follow a few specific steps which are not necessarily straightforward in order to save the extracted font data as a file which is re-usable.


Using mupdf

Next, MuPDF. This application comes with a utility called pdfextract (on Windows: pdfextract.exe) which can extract fonts and images from PDFs. (In case you don't know about MuPDF, which still is relatively unknown and new: "MuPDF is a Free lightweight PDF viewer and toolkit written in portable C.", written by Artifex Software developers, the same company that gave us Ghostscript.)
(Update: Newer versions of MuPDF have moved the former functionality of 'pdfextract' to the command 'mutool extract'. Download it here: mupdf.com/downloads)

Note: pdfextract.exe is a command-line program. To use it, do the following:

c:\>  pdfextract.exe  c:\path\to\filename.pdf         # (on Windows)
$>    pdfextract  /path/tofilename.pdf                # (on Linux, Unix, Mac OS X)

This command will dump all of the extractable files from the pdf file referenced into the current directory. Generally you will see a variety of files: images as well as fonts. These include PNG, TTF, CFF, CID, etc. The image names will be like img-0412.png if the PDF object number of the image was 412. The fontnames will be like FGETYK+LinLibertineI-0966.ttf, if the font's PDF object number was 966.

CFF (Compact Font Format) files are a recognized format that can be converted to other formats via a variety of converters for use on different operating systems.

Again: be aware that most of these font files may have only a subset of characters and may not represent the complete typeface.

Update: (Jul 2013) Recent versions of mupdf have seen an internal reshuffling and renaming of their binaries, not just once, but several times. The main utility used to be a 'swiss knife'-alike binary called mubusy (name inspired by busybox?), which more recently was renamed to mutool. These support the sub-commands info, clean, extract, poster and show. Unfortunatey, the official documentation for these tools isn't up to date (yet). If you're on a Mac using 'MacPorts': then the utility was renamed in order to avoid name clashes with other utilities using identical names, and you may need to use mupdfextract.

To achieve the (roughly) equivalent results with mutool as its previous tool pdfextract did, just run mubusy extract ....*

So to extract fonts and images, you may need to run one of the following commandlines:

c:\>  mutool.exe extract filename.pdf      # (on Windows)
$>    mutool     extract filename.pdf      # (on Linux, Unix, Mac OS X)

Downloads are here: mupdf.com/downloads


Using gs (Ghostscript)

Then, Ghostscript can also extract fonts directly from PDFs. However, it needs the help of a special utility program named extractFonts.ps, written in PostScript language, which is available from the Ghostscript source code repository.

Now use it, you need to run both, this file extractFonts.ps and your PDF file. Ghostscript will then use the instructions from the PostScript program to extract the fonts from the PDF. It looks like this on Windows (yes, Ghostscript understands the 'forward slash', /, as a path separator also on Windows!):

gswin32c.exe                  ^
  -q -dNODISPLAY              ^
   c:/path/to/extractFonts.ps ^
  -c "(c:/path/to/your/PDFFile.pdf) extractFonts quit"

or on Linux, Unix or Mac OS X:

gs                          \
  -q -dNODISPLAY            \
   /path/to/extractFonts.ps \
  -c "(/path/to/your/PDFFile.pdf) extractFonts quit"

I've tested the Ghostscript method a few years ago. At the time it did extract *.ttf (TrueType) just fine. I don't know if other font types will also be extracted at all, and if so, in a re-usable way. I don't know if the utility does block extracting of fonts which are marked as protected.


Using pdf-parser.py

Finally, Didier Stevens' pdf-parser.py: this one is probably not as easy to use, because you need to have some know-how about internal PDF structures. pdf-parser.py is a Python script which can do a lot of other things too. It can also decompress and extract arbitrary streams from objects, and therefore it can extract embedded font files too.

But you need to know what to look for. Let's see it with an example. I have a file named big.pdf. As a first step I use the -s parameter to search the PDF for any occurrence of the keyword FontFile (pdf-parser.py does not require a case sensitive search):

pdf-parser.py -s fontfile big.pdf

In my case, for my big1.pdf, I get this result:

obj 9 0
 Type: /FontDescriptor
 Referencing: 15 0 R
  <<   
    /Ascent 728
    /CapHeight 716
    /Descent -210 
    /Flags 32
    /FontBBox [ -665 -325 2000 1006 ]
    /FontFile2 15 0 R
    /FontName /ArialMT
    /ItalicAngle 0
    /StemV 87
    /Type /FontDescriptor
    /XHeight 519
  >>   

obj 11 0 
 Type: /FontDescriptor
 Referencing: 16 0 R
  <<   
    /Ascent 728
    /CapHeight 716
    /Descent -210 
    /Flags 262176
    /FontBBox [ -628 -376 2000 1018 ]
    /FontFile2 16 0 R
    /FontName /Arial-BoldMT
    /ItalicAngle 0
    /StemV 165
    /Type /FontDescriptor
    /XHeight 519
  >>   

It tells me that there are two instances of FontFile2 inside the PDF, and these are in PDF objects no. 15 and no. 16, respectively. Object no. 15 holds the /FontFile2 for font /ArialMT, object no. 16 holds the /FontFile2 for font /Arial-BoldMT.

To show this more clearly:

pdf-parser.py -s fontfile big1.pdf | grep -i fontfile
  /FontFile2 15 0 R
  /FontFile2 16 0 R

A quick peeking into the PDF specification reveals the the keyword /FontFile2 relates to a 'stream containing a TrueType font program' (/FontFile would relate to a 'stream containing a Type 1 font program' and /FontFile3 would relate to a 'stream containing a font program whose format is specified by the Subtype entry in the stream dictionary' {hence being either a Type1C or a CIDFontType0C subtype}.)

To look specifically at PDF object no. 15 (which holds the font /ArialMT), one can use the -o 15 parameter:

pdf-parser.py -o 15 big1.pdf

 obj 15 0
  Type: 
  Referencing: 
  Contains stream
   <<
     /Length1 778552
     /Length 1581435
     /Filter /ASCIIHexDecode
   >>

This pdf-parser.py output tells us that this object contains a stream (which it will not directly display) that has a length of 1.581.435 Bytes and is encoded ( == "compressed") with ASCIIHexEncode and needs to be decoded ( == "de-compressed" or "filtered") with the help of the standard /ASCIIHexDecode filter.

To dump any stream from an object, pdf-parser.py can be called with the -d dumpname parameter. Let's do it:

pdf-parser.py -o 15 -d dumped-data.ext big1.pdf

Our extracted data dump will be in the file named dumped-data.ext. Let's see how big it is:

ls -l dumped-data.ext
  -rw-r--r--  1 kurtpfeifle  staff  1581435 Apr 11 00:29 dumped-data.ext

Oh look, it is 1.581.435 Bytes. We saw this figure in the previous command's output. Opening this file with a text editor confirms that its content is ASCII hex encoded data.

Opening the file with a font reading tool like otfinfo (this is a part of the lcdf-typetools package) will lead to some disappointment at first:

otfinfo -i dumped-data.ext
  otfinfo: dumped-data.ext: not an OpenType font (bad magic number)

OK, this is because we did not (yet) let pdf-parser.py make use of its full magic: to dump a filtered, decoded stream. For this we have to add the -f parameter:

pdf-parser.py -o 15 -f -d dumped-data-decoded.ext big1.pdf

What's the size is this new file?

ls -l dumped-data-decoded.ext
  -rw-r--r--  1 kurtpfeifle  staff  778552 Apr 11 00:39 dumped-data-decoded.ext

Oh, look: that exact number was also already stored in the PDF object no. 15 dictionary as the value for key /Length1...

What does file think it is?

file dumped-data-decoded.ext
  dumped-data-decoded.ext: TrueType font data

What does otfinfo tell us about it?

otfinfo -i dumped-data-decoded.ext
  Family:              Arial
  Subfamily:           Regular
  Full name:           Arial
  PostScript name:     ArialMT
  Version:             Version 5.10
  Unique ID:           Monotype:Arial Regular:Version 5.10 (Microsoft)
  Designer:            Monotype Type Drawing Office - Robin Nicholas, Patricia Saunders 1982
  Manufacturer:        The Monotype Corporation
  Trademark:           Arial is a trademark of The Monotype Corporation.
  Copyright:           © 2011 The Monotype Corporation. All Rights Reserved.
  License Description: You may use this font to display and print content as permitted by
                       the license terms for the product in which this font is included.
                       You may only (i) embed this font in content as permitted by the 
                       embedding restrictions included in this font; and (ii) temporarily 
                       download this font to a printer or other output device to help
                       print content.
  Vendor ID:           TMC

So Bingo!, we have a winner: pdf-parser.py did indeed extract a valid font file for us. Given the size of this file (778.552 Bytes), it looks like this font had been embedded even completely in the PDF...

We could rename it to arial-regular.ttf and install it as such and happily make use of it.


Caveats:

  • In any case you need to follow the license that applies to the font. Some font licences do not allow free use and/or distribution. Pirating fonts is like pirating any software or other copyrighted material.

  • Most PDFs which are in the wild out there do not embed the full font anyway, but only subsets. Extracting a subset of a font is only useful in a very limited scope, if at all.

Please do also read the following about Pros and (more) Cons regarding font extraction efforts:

这篇关于我怎样才能从PDF提取嵌入的字体作为有效的字体文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆