PDF文本提取问题-字体/大写不一致 [英] PDF text extraction issue - font/capitalization inconsistencies

查看:514
本文介绍了PDF文本提取问题-字体/大写不一致的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从pdf书中提取文本,并继续出现一个问题,即复制的文本部分在粘贴到文本文档中时无法保留适当的大写属性.我有权复制该书,也拥有使用所有必要字体的许可.起初,我认为问题是由未嵌入的字体引起的,但是我检查了一下,所有字体似乎都被嵌入了子集.在pdf文件中,使用了超过100种具有以下属性之一的字体:

I am trying to extract text from a pdf book and continue to run an issue where sections of copied text fail to retain the proper capitalization properties when pasted into a text document. I have rights to reproduce the book and also have a license to use all necessary fonts. At first I thought that the issue was caused by the fonts not being embedded, but I checked and all fonts appear to be subset embedded. Within the pdf there are over 100 fonts used which have one of the following properties:

TrueType编码:Ansi TrueType(CID)编码:Identity-H 类型1(CID)编码:Identity-H 类型1编码:自定义

TrueType Encoding: Ansi TrueType (CID) Encoding: Identity-H Type 1 (CID) Encoding: Identity-H Type 1 Encoding: Custom

本书中的语言包括英语,德语,西班牙语和意大利语.在德国,资本是绝对关键的.倾向于丢失大写的属性而不是丢失大写的属性.

The languages within the book include English, German, Spanish and Italian. In German capitalization is absolutely critical. It tends to lose the uppercase properties more than the lower.

错误的示例为:WELD->焊接

An example of the error would be: WELD -> weld

我真的不知所措.我已经要求这本书的所有者将他已经完成的字体作为子集嵌入,但是问题仍然存在.我曾尝试将pdf文件保存为后记,然后通过distiller进行处理,这可以正确解决很多问题,但是在某些情况下,文本会替换为显示为头骨的不同字符或数字.我知道CID字体可能是导致此问题的原因,但是我遇到了一个实例,其中非CID字体具有相同的结果.

I am really at a loss at what to do here. I have requested that the owner of the book embed the fonts which he has done as subsets but the problem continues. I have tried saving the pdf file as a postscript and then ran it through distiller which correctly much of the problem, but in some cases resulted in text being replaced with different characters or numbers showing up as skulls. I understand that CID fonts might be contributing to the issue, but I have come across instance where a non CID font had the same result.

什么可能导致此问题?字体是子集字体还是完全嵌入字体?有没有更好的方法可以将本机文件(InDesign)保存为pdf,从而可以更好地提取字体?是否与非unicode字体有关?如果有,是否存在不需要所有者选择其他字体的替代方案?

What could be causing this issue? Is it that the fonts are subset versus fully embedded? Is there a better way to save the native file (InDesign) to a pdf that will allow for better font extraction? Does it have to do with non-unicode fonts and if so is there an alternative that does not require the owner to select different fonts?

我们将不胜感激.

推荐答案

这确实很有趣. OP提供的样本PDF确实确实包含大写字母,其中一些仅在大写字母行中显示,有些在大写字母行中,Adobe Reader将其提取为小写字母.

That's indeed funny. The sample PDF provided by the OP indeed visibly contains upper case characters, some of them in upper case only lines, some in mixed case lines, which by Adobe Reader are extracted as lower case characters.

你想知道

什么可能导致此问题?

What could be causing this issue?

作为一个例子,让我们看一下Pelle Più bella

As an example how that happens let's look at Pelle Più bella

在页面内容中,该短语实际上看起来像是大写字母的视觉表示:

In the page content that phrase actually looks like the visual representation in capital letters:

/T1_0 1 Tf
-0.025 Tc 12 0 0 12 379.5354 554.8809 Tm
(PELLE PI\331 BELLA)Tj

查看使用的字体 T1_0 (DIN粗体子集),我们发现它声称使用了 WinAnsiEncoding ,这也将在页面流为大写字母

Looking at the used font T1_0 (a DIN-Bold subset) we see that it claims to use WinAnsiEncoding which would also indicate an interpretation of those character codes in the page stream as capital letters

但是字体还具有 ToUnicode 映射,并且该映射也映射

But the font also has a ToUnicode mapping, and this mapping maps

<41> <0061> — 'A' → a
<42> <0062> — 'B' → b
<43> <0043> — 'C' → C
<44> <0044> — 'D' → D
<45> <0065> — 'E' → e
<49> <0069> — 'I' → i
<4C> <006C> — 'L' → l
<4D> <004D> — 'M' → M
<4E> <006E> — 'N' → n
<50> <0050> — 'P' → P
<52> <0072> — 'R' → r
<53> <0053> — 'S' → S
<54> <0074> — 'T' → t
<D9> <00F9> — 'Ù' → ù

(我只从WinAnsiEncoding中代表大写字母的字符代码中提取映射.)

(I only extracted the mappings from character codes which in WinAnsiEncoding represent capital letters.)

是否有更好的方法将本机文件(InDesign)保存为pdf,以便更好地提取字体?

Is there a better way to save the native file (InDesign) to a pdf that will allow for better font extraction?

对不起,我不太喜欢InDesign.但是如果这是Adobe的软件,那么如果这是InDesign中的错误或将其导出为PDF,我将感到惊讶.可能是InDesign文件中有一些信息将 PELLEPIÙBELLA 标记为 PellePiùbella ,然后InDesign在PDF导出中将其转换为该ToUnicode映射吗? /p>

Sorry, I'm not really into InDesign. But that software being from Adobe I would be surprised if that was a bug in InDesign or its export to PDF. Could it instead be that there are some information in the InDesign file which tag PELLE PIÙ BELLA as Pelle Più bella which InDesign then in the PDF export translates into this ToUnicode mapping?

它与非unicode字体有关吗?如果可以,是否存在一种不需要所有者选择其他字体的替代方法?

Does it have to do with non-unicode fonts and if so is there an alternative that does not require the owner to select different fonts?

如果您的示例文档中有三种字体,它们都带有 Encoding 条目 WinAnsiEncoding ,它们都是嵌入式子集,但是只有两种字体具有这种字体有趣的 ToUnicode 映射,DIN-Medium和DIN-Bold,而Helvetica没有 ToUnicode 映射.因此,它与字体有关.我到底怎么说呢.

In case of your sample document there are three fonts, all of them with an Encoding entry WinAnsiEncoding, all of them being an embedded subset, but only two have such funny ToUnicode mappings, DIN-Medium and DIN-Bold, while Helvetica has no ToUnicode mapping. So it somehow is font related. How exactly I cannot say.

一种解决方法,如果您的示例文档是从字体词典中删除 ToUnicode 映射.

A workaround in case of your sample document would be to remove the ToUnicode mapping from the font dictionaries.

例如,使用Java和iText库,您可以这样做:

For example using Java and the iText library you can do that like this:

PdfReader reader = new PdfReader(INPUT);
for (int i = 1; i <= reader.getXrefSize(); i++)
{
    PdfObject obj = reader.getPdfObject(i);
    if (obj != null && obj.isDictionary())
    {
        PdfDictionary dic = (PdfDictionary) obj;
        if (PdfName.FONT.equals(dic.getAsName(PdfName.TYPE)))
        {
            dic.remove(PdfName.TOUNICODE);
        }
    }
}
PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(OUTPUT));
stamper.close();
reader.close();

此操作后,Adobe Reader文本提取结果

After this manipulation Adobe Reader text extraction results in

PELLE PIÙ BELLA

这显然仅适用于示例文档中的情况.

This obviously only works in situations like the one in your sample document.

如果在其他文档中混合使用某些字体,其中某些字体需要使用它们各自的 ToUnicode 映射进行文本提取,而其他字体则类似于上面的麻烦字体,则可能需要添加一些附加条件Java代码仅删除越野车字体定义中的地图.

If in your other documents there is a mixture of fonts some of which require their respective ToUnicode map for text extraction while others are like the trouble fonts above, you might want to add some extra conditions to the Java code to only remove the map in the buggy font definitions.

这篇关于PDF文本提取问题-字体/大写不一致的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆