尝试从 pdf 中提取字形 ID 时缺少某些字形 ID [英] Some glyph ID's missing while trying to extract glyph ID from pdf

查看:82
本文介绍了尝试从 pdf 中提取字形 ID 时缺少某些字形 ID的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

由于天城字形映射到 unicode 字符不正确,我使用以下代码提取字形 ID 并形成我自己的映射以将 ID 映射到正确的 unicode 字符.

公共类 ExtractCharacterCodes {公共静态无效 testExtractFromSingNepChar() 抛出 IOException {PDDocument 文档 = PDDocument.load(new File("C:/PageSeparator/pattern3.pdf"));PDFTextStripper 剥离器 = 新 PDFTextStripper() {@覆盖protected void writeString(String text, List textPositions) 抛出 IOException {for (TextPosition textPosition : textPositions) {writeString(String.format("%s%s", textPosition.getUnicode(), Arrays.toString(textPosition.getCharacterCodes())));}}};//stripper.setSortByPosition(true);String text = stripper.getText(document);System.out.printf("\n*\n* singNepChar.pdf\n*\n%s\n", text);}public static void main(String[] args) 抛出 IOException {ExtractCharacterCodes.testExtractFromSingNepChar();}

}

在申请此 pdf 时

Due to Devanagiri glyph mapping to unicode character not being correct, I used the following code to extract the glyph ID and formed my own map to map ID's to proper unicode character.

public class ExtractCharacterCodes {
public static void testExtractFromSingNepChar() throws IOException {
    PDDocument document = PDDocument.load(new File("C:/PageSeparator/pattern3.pdf"));
    PDFTextStripper stripper = new PDFTextStripper() {
        @Override
        protected void writeString(String text, List<TextPosition> textPositions) throws IOException {
            for (TextPosition textPosition : textPositions) {
                writeString(String.format("%s%s", textPosition.getUnicode(), Arrays.toString(textPosition.getCharacterCodes())));
            }
        }
    };
    //stripper.setSortByPosition(true);
    String text = stripper.getText(document);

    System.out.printf("\n*\n* singNepChar.pdf\n*\n%s\n", text);
}

public static void main(String[] args) throws IOException {
    ExtractCharacterCodes.testExtractFromSingNepChar();
}

}

While applying to this pdf Nepali pdf

I got the following:स[1434]नु[1418] [3]त[1414]स्[7021]क[1399]र[1426]ी[1440]क[1399]ा[1438] [3]म[1424]खु्[6990]य[1425] [3]अ[1383]ा[4285]ा[1438]र[1426]ो[1451]प[1420]ी[1440] [3]'[39]ग[1401]ो[1451]रे[1426]'[39] [32] क[1399]ा[1438]ठ[1410]म[1424]ा[1438]ड[1411]ौं[7301]क[1399]ो[1451] [3]ग[1401]ौ[1452]र[1426]ी[1440]घ[1402]ा[1438]ट[1409]ब[1422]ा[1438]ट[1409] [3]प[1420]क्र[7059]ा[1438]उ[1387] [32] ज[1406]न[1418]क[1399]र[1426]ा[1438]ज[1406] [3]स[1434]ा[1438]प[1420]क[1399]ो[1451]ट[1409]ा[1438]त[1414]स्[1439]स्[7021]ब[1422]र[1426] [3]:[29] [3]क[1399]स्[1439]ि[1431]न[1418] [3]अ[1383]स्[1439]ध[1417]क[1399]ा[1438]र[1426]ी[1440] [32]|[124] [32]ज[1406]े[1447]ष्ठ[7399] [3] ८[1481],[44] [32]२[1475]०[1473]७[1480]५[1478] [32] and so on

as you can see i have a string "सुन" being separated as स[1434] , नु[1418]. I started making my own map of glyph ID to character but in this case, a glyph ID is missing. It shuld be स[1434], न[1441], ु[1418]. How do i get this?

解决方案

The cause is that the PDFTextStripper does not merely organize the TextPosition objects it retrieves from the underlying parser into lines and add implied spaces, it also does some additional preprocessing on them before forwarding to writeString. In particular it

  • suppresses duplicate overlapping glyphs: one way to create a poor man's bold effect is to draw glyphs twice with a tiny offset, and these duplicates are suppressed; and it
  • merges TextPosition objects containing a diacritic with the TextPosition containing the corresponding base glyph to a TextPosition representing the combined Unicode code point.

The former processing step can be disabled using PDFTextStripper.setSuppressDuplicateOverlappingText(false) but the latter cannot.

The effect you observe is due to the latter processing step.

If you want to get the glyphs without any preprocessing, i.e. without the duplicate suppression and the diacritic merge but also without organizing them into lines and adding of implied spaces, you can override processTextPosition instead of writeString:

PDDocument document = PDDocument.load(resource);
PDFTextStripper stripper = new PDFTextStripper() {
    @Override
    protected void processTextPosition(TextPosition textPosition) {
        try {
            writeString(String.format("%s%s", textPosition.getUnicode(), Arrays.toString(textPosition.getCharacterCodes())));
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
};

String text = stripper.getText(document);

(ExtractCharacterCodes test testExtractFromPattern3)

The result for your example document now is

स[1434]ु[1441]न[1418] [3]त[1414]स्[7021]क[1399]र[1426]ी[1440]क[1399]ा[1438] [3]...

If you still want the PDFTextStripper to organize the glyphs into lines and add implied spaces, you have to patch the that class (or your own copy of it) and at the end of its processTextPosition implementation disable the diacritics merging by replacing

// test if we overlap the previous entry.
// Note that we are making an assumption that we need to only look back
// one TextPosition to find what we are overlapping.
// This may not always be true. */
TextPosition previousTextPosition = textList.get(textList.size() - 1);
if (text.isDiacritic() && previousTextPosition.contains(text))
{
    previousTextPosition.mergeDiacritic(text);
}
// If the previous TextPosition was the diacritic, merge it into this
// one and remove it from the list.
else if (previousTextPosition.isDiacritic() && text.contains(previousTextPosition))
{
    text.mergeDiacritic(previousTextPosition);
    textList.remove(textList.size() - 1);
    textList.add(text);
}
else
{
    textList.add(text);
}

by a simple

textList.add(text);


By the way, your test file exposes an error in the PDFBox determination of the base glyph to merge a diacritic with: The "स[1434]ु[1441]न[1418]" is meant to be rendered as "सुन", i.e. the vowel sign u "ु" is combined with the letter sa "स", but PDFBox combines it with the subsequent letter na "न" as "सनु".

The cause is that it determines the letter to combine the diacritic with by its origin which here indeed is in the range of the latter letter na "न", but as the vowel sign glyph is rendered before its origin (it is drawn in an area with a negative x coordinate), PDFBox determines the wrong association:

这篇关于尝试从 pdf 中提取字形 ID 时缺少某些字形 ID的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆