PDFBox 2.0:克服字典密钥编码 [英] PDFBox 2.0: Overcoming dictionary key encoding

查看:93
本文介绍了PDFBox 2.0:克服字典密钥编码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Apache PDFBox 2.0.1从PDF表单中提取文本,并提取AcroForm字段的详细信息。我从单选按钮字段中挖掘外观字典。我对/ N和/ D条目(正常和向下外观)感兴趣。像这样(交互式Bean外壳):

I am extracting text from PDF forms with Apache PDFBox 2.0.1, extracting the details of AcroForm fields. From a radio button field I dig up the appearance dictionary. I'm interested in the /N and /D entries (normal and "down" appearance). Like this (interactive Bean shell):

field = form.getField(fieldName);
widgets = field.getWidgets();
print("Field Name: " + field.getPartialName() + " (" + widgets.size() + ")");
for (annot : widgets) {
  ap = annot.getAppearance();
  keys = ap.getCOSObject().getDictionaryObject("N").keySet();
  keyList = new ArrayList(keys.size());
  for (cosKey : keys) {keyList.add(cosKey.getName());}
  print(String.join("|", keyList));
}

输出为

Field Name: Krematorier (6)
Off|Skogskrem
Off|R�cksta
Off|Silverdal
Off|Stork�llan
Off|St Botvid
Nyn�shamn|Off

问号斑点应为瑞典字符ä或å。使用iText RUPS,我可以看到字典键是用ISO-8859-1编码的,而PDFBox假定它们是Unicode。

The question mark blotches should be Swedish characters "ä" or "å". Using iText RUPS I can see that the dictionary keys are encoded with ISO-8859-1 while PDFBox assumes they are Unicode, I guess.

有什么方法可以解码键使用ISO-8859-1?还是以其他方式正确检索密钥?

Is there any way of decoding the keys using ISO-8859-1? Or any other way to retrieve the keys correctly?

可以在此处下载此示例PDF表单: http://www.stockholm.se/PageFiles/85478/KYF%20211%20Best%C3%A4llning%202014.pdf

This sample PDF form can be downloaded here: http://www.stockholm.se/PageFiles/85478/KYF%20211%20Best%C3%A4llning%202014.pdf

推荐答案


使用iText RUPS我可以看到字典键是用ISO编码的-8859-1,而PDFBox假定它们是Unicode。

Using iText RUPS I can see that the dictionary keys are encoded with ISO-8859-1 while PDFBox assumes they are Unicode, I guess.

有什么方法可以使用ISO-8859-1解码密钥吗?还是任何其他正确检索密钥的方法?

Is there any way of decoding the keys using ISO-8859-1? Or any other way to retrieve the keys correctly?


更改假定的编码


PDFBox对编码的解释从源PDF读取名称时,名称中的字节数(仅名称可以用作PDF中的字典键)在 BaseParser.parseCOSName()中发生:

/**
 * This will parse a PDF name from the stream.
 *
 * @return The parsed PDF name.
 * @throws IOException If there is an error reading from the stream.
 */
protected COSName parseCOSName() throws IOException
{
    readExpectedChar('/');
    ByteArrayOutputStream buffer = new ByteArrayOutputStream();
    int c = seqSource.read();
    while (c != -1)
    {
        int ch = c;
        if (ch == '#')
        {
            int ch1 = seqSource.read();
            int ch2 = seqSource.read();
            if (isHexDigit((char)ch1) && isHexDigit((char)ch2))
            {
                String hex = "" + (char)ch1 + (char)ch2;
                try
                {
                    buffer.write(Integer.parseInt(hex, 16));
                }
                catch (NumberFormatException e)
                {
                    throw new IOException("Error: expected hex digit, actual='" + hex + "'", e);
                }
                c = seqSource.read();
            }
            else
            {
                // check for premature EOF
                if (ch2 == -1 || ch1 == -1)
                {
                    LOG.error("Premature EOF in BaseParser#parseCOSName");
                    c = -1;
                    break;
                }
                seqSource.unread(ch2);
                c = ch1;
                buffer.write(ch);
            }
        }
        else if (isEndOfName(ch))
        {
            break;
        }
        else
        {
            buffer.write(ch);
            c = seqSource.read();
        }
    }
    if (c != -1)
    {
        seqSource.unread(c);
    }
    String string = new String(buffer.toByteArray(), Charsets.UTF_8);
    return COSName.getPDFName(string);
}

如您所见,在读取了名称字节并解释了#转义序列后,PDFBox无条件地解释了结果字节以UTF-8编码。因此,要更改此设置,必须修补此PDFBox类并替换底部命名的字符集。

As you can see, after reading the name bytes and interpreting the # escape sequences, PDFBox unconditionally interprets the resulting bytes as UTF-8 encoded. To change this, therefore, you have to patch this PDFBox class and replace the charset named at the bottom.

根据根据规范,当将名称对象视为文本时

According to the specification, when treating a name object as text


字节序列(在扩展NUMBER SIGN序列后,如果有的话)应根据UTF进行解释-8,Unicode的变长字节编码表示形式,其中可打印的ASCII字符与ASCII中的表示形式相同。

the sequence of bytes (after expansion of NUMBER SIGN sequences, if any) should be interpreted according to UTF-8, a variable-length byte-encoded representation of Unicode in which the printable ASCII characters have the same representations as in ASCII.

(第7.3.5节名称对象, ISO 32000- 1

BaseParser.parseCOSName()就是这样实现的。

PDFBox的实现不是完全正确的,因为已经不需要将名称解释为字符串的行为是错误的:

PDFBox' implementation is not completely correct, though, as already the act of interpreting the name as string without need is wrong:


名称对象应被视为阿托PDF文件中的麦克风。通常,组成名称的字节永远不会被视为要呈现给人类用户或符合条件的阅读器外部的应用程序的文本。但是,偶尔会出现将名称对象视为文本的需求

name objects shall be treated as atomic within a PDF file. Ordinarily, the bytes making up the name are never treated as text to be presented to a human user or to an application external to a conforming reader. However, occasionally the need arises to treat a name object as text

因此,PDF库应尽可能将名称作为字节数组处理,并且只能找到一个明确表示需要使用字符串表示形式时,只有以上建议(假定UTF-8)才起作用。该规范甚至指出了可能在哪里引起麻烦的地方:

Thus, PDF libraries should handle names as byte arrays as long as possible and only find a string representation when it is explicitly required, and only then the recommendation above (to assume UTF-8) should play a role. The specification even indicates where this may cause trouble:


PDF并没有规定要选择哪种UTF-8序列来将任何给定的外部指定文本表示为名称对象。在某些情况下,多个UTF-8序列可能表示相同的逻辑文本。即使UTF-8序列可能具有相同的外部解释,由不同字节序列定义的名称对象也构成PDF中不同的名称对象。

PDF does not prescribe what UTF-8 sequence to choose for representing any given piece of externally specified text as a name object. In some cases, multiple UTF-8 sequences may represent the same logical text. Name objects defined by different sequences of bytes constitute distinct name objects in PDF, even though the UTF-8 sequences may have identical external interpretations.

很明显,在手头的文档中,如果字节序列不构成有效的UTF-8,则它仍然是有效的名称。但是通过上面的方法会更改这样的名称,任何无法解析的字节或子序列都将被Unicode替换字符。''替换。因此,不同的名称可能会合并为一个。

Another situation becomes apparent in the document at hand, if the sequence of bytes constitutes no valid UTF-8, it still is a valid name. But such names are changed by the method above, any unparsable byte or subsequence is replaced by the Unicode Replacement Character '�'. Thus, different names may collapse into a single one.

另一个问题是,在写回PDF时,PDFBox不是对称地起作用,而是解释使用名称 US_ASCII 的字符串表示形式(如果从PDF中读取,则表示为UTF-8解释), cf. COSName.writePDF(OutputStream)

Another issue is that when writing back a PDF, PDFBox is not acting symmetrically but instead interprets the String representation of the name (which has been retrieved as a UTF-8 interpretation if read from a PDF) using pure US_ASCII, cf. COSName.writePDF(OutputStream):

public void writePDF(OutputStream output) throws IOException
{
    output.write('/');
    byte[] bytes = getName().getBytes(Charsets.US_ASCII);
    for (byte b : bytes)
    {
        int current = (b + 256) % 256;

        // be more restrictive than the PDF spec, "Name Objects", see PDFBOX-2073
        if (current >= 'A' && current <= 'Z' ||
                current >= 'a' && current <= 'z' ||
                current >= '0' && current <= '9' ||
                current == '+' ||
                current == '-' ||
                current == '_' ||
                current == '@' ||
                current == '*' ||
                current == '$' ||
                current == ';' ||
                current == '.')
        {
            output.write(current);
        }
        else
        {
            output.write('#');
            output.write(String.format("%02X", current).getBytes(Charsets.US_ASCII));
        }
    }
}

因此,任何有趣的Unicode字符都被替换为

Thus, any interesting Unicode character is replaced with the US_ASCII default replacement character which I assume to be '?'.

因此很幸运,PDF名称最经常只包含ASCII字符...;)

So it is quite fortunate that PDF names most often do merely contain ASCII characters... ;)

根据PDF 1.4参考中的实施说明,

According to the implementation notes from the PDF 1.4 reference,


在Acrobat 4.0和更高版本中在早期版本中,通常将以主机平台编码来解释被视为文本的名称对象,该编码取决于操作系统和本地语言。对于亚洲语言,此编码可能类似于Shift-JIS或Big Five。因此,有必要区分以这种方式编码的名称和以UTF-8编码的名称。幸运的是,UTF-8编码风格很强,通常可以识别其使用。发现不符合UTF-8编码规则的名称可以改为根据主机平台编码来解释。

In Acrobat 4.0 and earlier versions, a name object being treated as text will typically be interpreted in a host platform encoding, which depends on the operating system and the local language. For Asian languages, this encoding may be something like Shift-JIS or Big Five. Consequently, it will be necessary to distinguish between names encoded this way and ones encoded as UTF-8. Fortunately, UTF-8 encoding is very stylized and its use can usually be recognized. A name that is found not to conform to UTF-8 encoding rules can instead be interpreted according to host platform encoding.

因此,手头的示例文档似乎遵循Acrobat 4的惯例,即上个世纪的惯例。

Thus, the sample document at hand seems to follow conventions from Acrobat 4, i.e. from the last century.

源代码摘录摘自PDFBox 2.0.0,但乍一看似乎在2.0中并未更改。 .1或开发主干。

这篇关于PDFBox 2.0:克服字典密钥编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆